Words Are All You Need?: Using CLIP Images and Captions to Predict fMRI Responses from the Algonauts Challenge 2023¶

Cognitive AI and Neuroimaging, 2023¶

Authors: Luke Korthals, Luca Thoms, Lucija Blaževski, Arne John, Mirthe van Dijk¶

Disclaimer¶

Participation name on the Algonatus leaderboard: "LukeK" (Score: 50.5408)

We ran and tested all models locally on a single AMD GPU. Furthermore, any additional data used (on top of the challenge data) as well as this notebook can be found in our GitHub repository (e.g., CSV file encoding the COCO ID, NSD ID, and caption matches)

Link: https://github.com/lukekorthals/algonauts-2023

Abstract¶

Recent advances in machine learning have enabled the development of computational models that can predict brain responses to visual stimuli. In this study, we investigate the effectiveness of the CLIP (Contrastive Language-Image Pre-Training) model. This is a state-of-the-art neural network model developed by OpenAI to predict brain activation across the whole brain in response to image stimuli in the Algonauts 2023 Challenge. We proposed three CLIP-based solutions: a Text model, a Vision model, and a combination of features from the Text and Vision models. We hypothesized that a combination of features from the Text and Vision models would provide the best approximation of the brain responses. The data consists of fMRI recordings from participants viewing complex natural visual scenes. This combination of features will be able to effectively recognize objects and infer relationships between them by leveraging its cross-modal understanding. The results showed that the Text model was the worst performing model to predict brain activation, while the Vision model outperformed the other models. Unexpectedly, the Combination model performed equal or worse than the Vision model. As our final solution to the Algonauts 2023 Challenge, we propose combining layer 4 and 8 of the CLIP Vision model. Our findings suggest that the CLIP Vision model provides a more effective approach for predicting brain activation in response to visual stimuli than either the Text model or Combination model, particulary for the early layers.

Introduction¶

Context¶

The Algonaut Challenge 2023 is established to investigate how well current computational models are doing. It is an open challenge to stimulate the combination of the biological and artificial intelligence fields. Participants are expected to build computational models to predict brain responses for images from which brain data was held out. In this report we describe what computational model we propose to predict brain data.

Our Solution: CLIP¶

We propose the use of CLIP (Contrastive Language-Image Pre-Training, Radford et al., 2021), which is a state-of-the-art neural network model developed by OpenAI that can understand the relationship between images and text. It is based on a transformer architecture, similar to those used in language models like GPT-3, and is pre-trained on a large amount of text and images in an unsupervised way. What sets CLIP apart from other models is its ability to learn cross-modal representations of images and text, allowing it to form the relationship between the two modalities. This is achieved by using a Vision Transformer, which splits an image into patches that are then linearly embedded and these vectors are then given into a standard transformer encoder (For more details, see Dosovitskiy et al., 2021). Additionally, its contrastive learning objective allows the model to associate text and images that refer to the same concepts. For example, the model learns that an image of a dog and a caption that says "a dog running in a park" is related to the concept of a dog.

Challenge Data and Motivation for Using CLIP¶

The Algonauts 2023 Challenge data comes from the Natural Scenes Dataset (NSD, Allen et al., 2022), a massive 8-subjects dataset of 7T fMRI responses to images of natural scenes coming from the COCO database (Lin et al., 2014). The COCO dataset contains a large number of images labeled with object categories, captions, and annotations. As previously mentioned, the key advantage of CLIP is its ability to learn cross-modal representations of images and text, allowing it to understand the relationship between the two modalities (Radford et al., 2021). This is particularly relevant for the COCO dataset, which contains both visual and textual information, such as image captions and object annotations. By leveraging this cross-modal understanding, CLIP can more effectively learn to recognize concepts and infer relationships between them. Given that the Challenge evaluation score is computed over the whole visual brain, we hypothesized that a combined computational model, in which visual properties interact with more abstract semantic information, offers the best solution for this challenge. Also, because CLIP has been trained on a large amount of text and images in an unsupervised way, it can generalize to new and unseen data. This is crucial in Algonauts Challenge 2023, where the goal is to build a model that can understand language and vision in a way that allows it to predict new contexts and tasks. CLIP's pre-training on large amounts of data also means that it can be fine-tuned to smaller datasets, like those used in the Algonauts challenge, to further improve its performance.

Goal¶

Here, we propose a CLIP model that uses a ViT-B/32 Transformer architecture as an image encoder and a masked self-attention Transformer as a text encoder as a solution to the Algonauts 2023 challenge. To this extend, we created three models: (1) a CLIP vision model, (2) a CLIP Text model, and (3) a combination of features from the CLIP Text and CLIP Vision models.

Hypothesis¶

As recently suggested by Marjieh et al. (2023) in the field of similarity judgments, stacking the representations of language-based models and models that do not rely on texts (including image-based models) is consistently providing the best approximation for human similarity judgements. We believed that similarity judgments can be applied to viewing complex natural visual scenes. Furthermore, they found that stacking the representations from CLIP image and CLIP text models outperforms the individual models. Therefore, we hypothesized that a combination of features from the CLIP Text and CLIP Vision models provide the best approximation of brain responses recorded of participants viewing complex natural visual scenes.

In the following we will load packages and functions to assist in answering the research question.

Initialization¶

Importing all required packages and models to run this notebook.

In [1]:
# Import required packages
import os
import glob
import time
import numpy as np
import pandas as pd
import torch
import torch_directml
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from transformers import AutoProcessor, CLIPTextModel, CLIPVisionModel, PreTrainedModel
from torch.utils.data import Dataset, DataLoader
from sklearn.decomposition import IncrementalPCA, PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import euclidean_distances
from scipy.stats import pearsonr as corr
from scipy.stats import spearmanr
from scipy.spatial.distance import squareform
from tqdm import tqdm
from typing import List, Dict, Tuple
from nilearn import datasets
from nilearn import plotting
import warnings
warnings.filterwarnings("ignore") # to reduce unnecessary output
c:\Users\luke-\Desktop\Python Repositories\algonauts-2023\.conda\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

We used an AMD RX 5700XT GPU and AMD Ryzen 5 3600XT CPU to run this code. To enable GPU support we relied on torch_directml == 0.1.13.1.dev230413.

If you want to run this notebook on a cuda device, set AMD to False.

In [2]:
# Setup cuda device
global device
AMD = True
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if AMD:
    device = torch_directml.device()

First we define the models we are going to use in this notebook. This is necessary, as we will use some of the functionalities throughout the notebook. We downloaded pretrained CLIPModels and the CLIPProcessor from huggingface.

In [3]:
# Defining Models
global vis_model, txt_model, processor
vis_model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
txt_model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
vis_model.eval()
txt_model.eval()
Some weights of the model checkpoint at openai/clip-vit-base-patch32 were not used when initializing CLIPVisionModel: ['text_model.encoder.layers.3.self_attn.out_proj.weight', 'text_model.encoder.layers.11.self_attn.out_proj.bias', 'text_model.encoder.layers.1.self_attn.q_proj.bias', 'text_model.encoder.layers.10.layer_norm2.bias', 'text_model.encoder.layers.0.mlp.fc2.bias', 'visual_projection.weight', 'text_model.encoder.layers.2.layer_norm2.bias', 'text_model.encoder.layers.3.self_attn.out_proj.bias', 'text_model.encoder.layers.2.layer_norm1.weight', 'text_model.encoder.layers.4.layer_norm1.bias', 'text_model.encoder.layers.2.layer_norm1.bias', 'text_model.encoder.layers.0.self_attn.q_proj.weight', 'text_model.encoder.layers.4.layer_norm2.bias', 'text_model.encoder.layers.6.self_attn.out_proj.weight', 'text_model.encoder.layers.7.mlp.fc1.weight', 'text_model.encoder.layers.10.layer_norm2.weight', 'text_model.encoder.layers.6.mlp.fc1.bias', 'text_model.encoder.layers.9.layer_norm2.weight', 'text_model.encoder.layers.6.self_attn.q_proj.bias', 'text_model.encoder.layers.4.self_attn.k_proj.bias', 'text_model.encoder.layers.7.self_attn.out_proj.weight', 'text_model.encoder.layers.5.layer_norm1.weight', 'text_model.encoder.layers.0.self_attn.k_proj.bias', 'text_model.encoder.layers.4.self_attn.q_proj.weight', 'text_model.encoder.layers.10.mlp.fc1.bias', 'text_model.encoder.layers.3.mlp.fc2.bias', 'text_model.encoder.layers.9.mlp.fc1.bias', 'text_model.encoder.layers.2.self_attn.q_proj.bias', 'text_model.encoder.layers.4.self_attn.q_proj.bias', 'text_model.encoder.layers.7.self_attn.k_proj.bias', 'text_model.encoder.layers.6.mlp.fc1.weight', 'text_model.encoder.layers.5.layer_norm2.bias', 'text_model.encoder.layers.1.self_attn.k_proj.weight', 'text_model.encoder.layers.9.self_attn.k_proj.bias', 'text_model.encoder.layers.3.self_attn.q_proj.bias', 'text_model.encoder.layers.2.mlp.fc1.weight', 'text_model.encoder.layers.4.mlp.fc1.weight', 'text_model.encoder.layers.8.self_attn.k_proj.bias', 'text_model.encoder.layers.7.self_attn.q_proj.weight', 'text_model.encoder.layers.9.mlp.fc2.weight', 'text_model.encoder.layers.8.self_attn.out_proj.weight', 'text_model.encoder.layers.1.self_attn.v_proj.bias', 'text_model.encoder.layers.9.layer_norm2.bias', 'text_model.encoder.layers.8.mlp.fc2.bias', 'text_model.encoder.layers.10.layer_norm1.weight', 'text_model.encoder.layers.3.mlp.fc1.bias', 'text_model.encoder.layers.2.self_attn.q_proj.weight', 'text_model.encoder.layers.8.layer_norm2.bias', 'text_model.encoder.layers.2.self_attn.k_proj.bias', 'text_model.encoder.layers.3.layer_norm2.bias', 'text_model.encoder.layers.8.mlp.fc1.bias', 'text_model.encoder.layers.8.self_attn.v_proj.bias', 'text_model.encoder.layers.7.mlp.fc2.bias', 'text_model.encoder.layers.1.mlp.fc2.weight', 'text_model.encoder.layers.4.mlp.fc1.bias', 'text_model.encoder.layers.6.mlp.fc2.bias', 'text_model.encoder.layers.3.self_attn.q_proj.weight', 'text_model.encoder.layers.11.self_attn.q_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.bias', 'text_model.encoder.layers.10.self_attn.v_proj.bias', 'text_model.encoder.layers.6.layer_norm2.bias', 'text_model.encoder.layers.9.self_attn.q_proj.weight', 'text_model.encoder.layers.2.mlp.fc1.bias', 'text_model.embeddings.token_embedding.weight', 'text_model.final_layer_norm.bias', 'text_model.encoder.layers.10.layer_norm1.bias', 'text_model.embeddings.position_embedding.weight', 'text_model.encoder.layers.2.mlp.fc2.weight', 'text_model.encoder.layers.0.layer_norm1.bias', 'text_model.encoder.layers.1.self_attn.v_proj.weight', 'text_model.encoder.layers.3.layer_norm1.bias', 'text_model.encoder.layers.7.self_attn.q_proj.bias', 'text_model.encoder.layers.11.layer_norm2.bias', 'text_model.encoder.layers.1.mlp.fc1.bias', 'text_model.encoder.layers.10.self_attn.out_proj.weight', 'text_model.encoder.layers.4.self_attn.k_proj.weight', 'text_model.encoder.layers.8.layer_norm1.bias', 'text_model.encoder.layers.5.mlp.fc2.weight', 'text_model.encoder.layers.3.self_attn.k_proj.weight', 'text_model.encoder.layers.11.layer_norm1.bias', 'text_model.encoder.layers.2.mlp.fc2.bias', 'text_model.encoder.layers.4.self_attn.v_proj.bias', 'text_model.encoder.layers.10.mlp.fc2.bias', 'text_model.embeddings.position_ids', 'text_model.encoder.layers.6.self_attn.q_proj.weight', 'text_model.encoder.layers.0.self_attn.out_proj.weight', 'text_model.encoder.layers.0.self_attn.q_proj.bias', 'text_model.encoder.layers.8.layer_norm2.weight', 'text_model.encoder.layers.6.mlp.fc2.weight', 'text_model.encoder.layers.11.layer_norm1.weight', 'text_model.encoder.layers.9.mlp.fc2.bias', 'text_model.encoder.layers.8.layer_norm1.weight', 'text_model.encoder.layers.4.mlp.fc2.weight', 'text_model.encoder.layers.11.layer_norm2.weight', 'text_model.encoder.layers.9.self_attn.k_proj.weight', 'text_model.encoder.layers.3.self_attn.v_proj.weight', 'text_model.encoder.layers.5.self_attn.out_proj.bias', 'text_model.encoder.layers.8.self_attn.q_proj.weight', 'text_model.encoder.layers.5.mlp.fc1.weight', 'text_model.encoder.layers.0.mlp.fc1.weight', 'text_model.encoder.layers.4.layer_norm1.weight', 'text_model.encoder.layers.10.mlp.fc2.weight', 'text_model.encoder.layers.8.self_attn.v_proj.weight', 'text_model.encoder.layers.4.self_attn.out_proj.weight', 'text_model.encoder.layers.3.layer_norm1.weight', 'text_model.encoder.layers.2.self_attn.out_proj.bias', 'text_model.encoder.layers.11.self_attn.k_proj.bias', 'text_model.encoder.layers.6.layer_norm1.bias', 'text_model.encoder.layers.6.self_attn.k_proj.weight', 'text_model.encoder.layers.5.layer_norm2.weight', 'text_model.encoder.layers.1.layer_norm2.bias', 'text_model.encoder.layers.1.layer_norm1.weight', 'text_model.encoder.layers.7.layer_norm1.bias', 'logit_scale', 'text_model.encoder.layers.11.mlp.fc2.weight', 'text_model.encoder.layers.0.layer_norm2.weight', 'text_model.encoder.layers.1.self_attn.k_proj.bias', 'text_model.encoder.layers.7.self_attn.k_proj.weight', 'text_model.encoder.layers.7.mlp.fc1.bias', 'text_model.encoder.layers.1.self_attn.q_proj.weight', 'text_model.encoder.layers.1.self_attn.out_proj.bias', 'text_model.encoder.layers.6.layer_norm2.weight', 'text_model.encoder.layers.7.layer_norm2.bias', 'text_model.encoder.layers.5.mlp.fc1.bias', 'text_model.encoder.layers.2.self_attn.v_proj.bias', 'text_model.encoder.layers.0.self_attn.v_proj.weight', 'text_model.encoder.layers.2.layer_norm2.weight', 'text_model.encoder.layers.9.layer_norm1.bias', 'text_model.encoder.layers.10.mlp.fc1.weight', 'text_model.final_layer_norm.weight', 'text_model.encoder.layers.5.layer_norm1.bias', 'text_model.encoder.layers.5.self_attn.k_proj.weight', 'text_model.encoder.layers.3.self_attn.k_proj.bias', 'text_model.encoder.layers.5.self_attn.out_proj.weight', 'text_model.encoder.layers.1.layer_norm1.bias', 'text_model.encoder.layers.11.self_attn.q_proj.bias', 'text_model.encoder.layers.10.self_attn.q_proj.bias', 'text_model.encoder.layers.2.self_attn.out_proj.weight', 'text_model.encoder.layers.7.self_attn.v_proj.bias', 'text_model.encoder.layers.5.self_attn.q_proj.weight', 'text_model.encoder.layers.9.self_attn.q_proj.bias', 'text_model.encoder.layers.6.self_attn.v_proj.bias', 'text_model.encoder.layers.9.self_attn.out_proj.weight', 'text_model.encoder.layers.7.layer_norm1.weight', 'text_model.encoder.layers.0.self_attn.k_proj.weight', 'text_model.encoder.layers.10.self_attn.k_proj.weight', 'text_model.encoder.layers.11.self_attn.out_proj.weight', 'text_model.encoder.layers.9.self_attn.v_proj.bias', 'text_model.encoder.layers.1.layer_norm2.weight', 'text_model.encoder.layers.0.mlp.fc1.bias', 'text_model.encoder.layers.8.self_attn.q_proj.bias', 'text_model.encoder.layers.10.self_attn.q_proj.weight', 'text_model.encoder.layers.0.mlp.fc2.weight', 'text_model.encoder.layers.2.self_attn.k_proj.weight', 'text_model.encoder.layers.9.self_attn.out_proj.bias', 'text_model.encoder.layers.6.layer_norm1.weight', 'text_model.encoder.layers.8.mlp.fc1.weight', 'text_model.encoder.layers.5.self_attn.v_proj.weight', 'text_model.encoder.layers.11.mlp.fc2.bias', 'text_model.encoder.layers.4.layer_norm2.weight', 'text_model.encoder.layers.11.mlp.fc1.weight', 'text_model.encoder.layers.0.layer_norm1.weight', 'text_model.encoder.layers.8.mlp.fc2.weight', 'text_model.encoder.layers.2.self_attn.v_proj.weight', 'text_model.encoder.layers.0.self_attn.v_proj.bias', 'text_model.encoder.layers.0.layer_norm2.bias', 'text_model.encoder.layers.8.self_attn.out_proj.bias', 'text_model.encoder.layers.7.self_attn.out_proj.bias', 'text_model.encoder.layers.1.self_attn.out_proj.weight', 'text_model.encoder.layers.6.self_attn.v_proj.weight', 'text_model.encoder.layers.11.self_attn.v_proj.bias', 'text_model.encoder.layers.11.self_attn.k_proj.weight', 'text_model.encoder.layers.3.mlp.fc2.weight', 'text_model.encoder.layers.5.self_attn.q_proj.bias', 'text_model.encoder.layers.1.mlp.fc1.weight', 'text_model.encoder.layers.11.self_attn.v_proj.weight', 'text_model.encoder.layers.3.self_attn.v_proj.bias', 'text_model.encoder.layers.4.mlp.fc2.bias', 'text_model.encoder.layers.5.self_attn.k_proj.bias', 'text_model.encoder.layers.9.mlp.fc1.weight', 'text_model.encoder.layers.3.mlp.fc1.weight', 'text_model.encoder.layers.10.self_attn.v_proj.weight', 'text_model.encoder.layers.4.self_attn.out_proj.bias', 'text_model.encoder.layers.11.mlp.fc1.bias', 'text_model.encoder.layers.1.mlp.fc2.bias', 'text_model.encoder.layers.10.self_attn.k_proj.bias', 'text_model.encoder.layers.3.layer_norm2.weight', 'text_model.encoder.layers.4.self_attn.v_proj.weight', 'text_model.encoder.layers.6.self_attn.out_proj.bias', 'text_model.encoder.layers.7.mlp.fc2.weight', 'text_model.encoder.layers.5.self_attn.v_proj.bias', 'text_projection.weight', 'text_model.encoder.layers.6.self_attn.k_proj.bias', 'text_model.encoder.layers.10.self_attn.out_proj.bias', 'text_model.encoder.layers.7.layer_norm2.weight', 'text_model.encoder.layers.7.self_attn.v_proj.weight', 'text_model.encoder.layers.8.self_attn.k_proj.weight', 'text_model.encoder.layers.9.layer_norm1.weight', 'text_model.encoder.layers.9.self_attn.v_proj.weight', 'text_model.encoder.layers.5.mlp.fc2.bias']
- This IS expected if you are initializing CLIPVisionModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPVisionModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at openai/clip-vit-base-patch32 were not used when initializing CLIPTextModel: ['vision_model.encoder.layers.9.layer_norm1.weight', 'vision_model.encoder.layers.4.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.self_attn.q_proj.bias', 'vision_model.encoder.layers.0.layer_norm1.weight', 'vision_model.encoder.layers.2.mlp.fc1.weight', 'visual_projection.weight', 'vision_model.encoder.layers.5.mlp.fc2.bias', 'vision_model.encoder.layers.11.self_attn.k_proj.bias', 'vision_model.encoder.layers.4.self_attn.v_proj.bias', 'vision_model.encoder.layers.10.layer_norm1.bias', 'vision_model.encoder.layers.1.layer_norm1.bias', 'vision_model.encoder.layers.8.self_attn.out_proj.bias', 'vision_model.encoder.layers.1.self_attn.k_proj.bias', 'vision_model.encoder.layers.7.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.mlp.fc2.weight', 'vision_model.encoder.layers.6.mlp.fc1.bias', 'vision_model.encoder.layers.0.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.self_attn.out_proj.bias', 'vision_model.encoder.layers.4.mlp.fc2.bias', 'vision_model.encoder.layers.0.layer_norm1.bias', 'vision_model.embeddings.position_embedding.weight', 'vision_model.encoder.layers.10.layer_norm2.weight', 'vision_model.encoder.layers.1.self_attn.v_proj.weight', 'vision_model.encoder.layers.11.layer_norm1.weight', 'vision_model.encoder.layers.7.layer_norm2.weight', 'vision_model.encoder.layers.1.mlp.fc1.bias', 'vision_model.encoder.layers.11.mlp.fc2.bias', 'vision_model.encoder.layers.9.layer_norm2.weight', 'vision_model.encoder.layers.7.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.self_attn.out_proj.bias', 'vision_model.encoder.layers.8.self_attn.v_proj.weight', 'vision_model.encoder.layers.6.self_attn.q_proj.weight', 'vision_model.encoder.layers.5.self_attn.out_proj.bias', 'vision_model.encoder.layers.1.layer_norm2.bias', 'vision_model.encoder.layers.9.mlp.fc2.bias', 'vision_model.embeddings.patch_embedding.weight', 'vision_model.encoder.layers.7.layer_norm1.bias', 'vision_model.encoder.layers.11.self_attn.q_proj.weight', 'vision_model.encoder.layers.8.layer_norm1.weight', 'vision_model.encoder.layers.6.self_attn.q_proj.bias', 'vision_model.encoder.layers.8.layer_norm2.bias', 'vision_model.encoder.layers.5.self_attn.q_proj.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.weight', 'vision_model.encoder.layers.11.layer_norm2.weight', 'vision_model.encoder.layers.2.mlp.fc2.weight', 'vision_model.encoder.layers.0.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.mlp.fc2.weight', 'vision_model.encoder.layers.8.self_attn.q_proj.bias', 'vision_model.encoder.layers.0.mlp.fc1.weight', 'vision_model.encoder.layers.8.self_attn.k_proj.bias', 'vision_model.encoder.layers.9.mlp.fc2.weight', 'vision_model.encoder.layers.4.self_attn.out_proj.weight', 'vision_model.encoder.layers.1.self_attn.q_proj.weight', 'vision_model.encoder.layers.11.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.mlp.fc1.bias', 'vision_model.encoder.layers.2.layer_norm2.bias', 'vision_model.encoder.layers.10.self_attn.q_proj.bias', 'vision_model.encoder.layers.10.self_attn.k_proj.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.bias', 'vision_model.encoder.layers.5.self_attn.v_proj.bias', 'vision_model.pre_layrnorm.weight', 'vision_model.encoder.layers.3.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.layer_norm2.weight', 'vision_model.encoder.layers.10.layer_norm2.bias', 'vision_model.encoder.layers.4.self_attn.out_proj.bias', 'vision_model.encoder.layers.11.mlp.fc1.bias', 'vision_model.pre_layrnorm.bias', 'vision_model.encoder.layers.8.self_attn.v_proj.bias', 'vision_model.encoder.layers.5.layer_norm1.bias', 'vision_model.encoder.layers.6.layer_norm2.weight', 'vision_model.encoder.layers.4.mlp.fc1.weight', 'vision_model.encoder.layers.8.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.self_attn.out_proj.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.bias', 'vision_model.encoder.layers.6.mlp.fc2.bias', 'vision_model.encoder.layers.2.layer_norm2.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.self_attn.q_proj.bias', 'vision_model.encoder.layers.6.mlp.fc1.weight', 'vision_model.encoder.layers.11.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.weight', 'vision_model.encoder.layers.8.mlp.fc1.weight', 'vision_model.encoder.layers.3.mlp.fc1.weight', 'vision_model.encoder.layers.1.mlp.fc2.bias', 'vision_model.encoder.layers.10.mlp.fc2.bias', 'vision_model.encoder.layers.6.layer_norm2.bias', 'vision_model.encoder.layers.4.layer_norm2.weight', 'vision_model.encoder.layers.3.layer_norm1.bias', 'vision_model.encoder.layers.0.mlp.fc2.bias', 'vision_model.encoder.layers.8.self_attn.q_proj.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.weight', 'vision_model.encoder.layers.10.self_attn.q_proj.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.weight', 'vision_model.encoder.layers.10.self_attn.out_proj.bias', 'vision_model.encoder.layers.11.self_attn.v_proj.bias', 'vision_model.encoder.layers.0.self_attn.k_proj.weight', 'vision_model.encoder.layers.4.layer_norm1.weight', 'vision_model.encoder.layers.0.self_attn.out_proj.weight', 'vision_model.embeddings.position_ids', 'vision_model.encoder.layers.4.mlp.fc1.bias', 'vision_model.encoder.layers.11.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.mlp.fc2.weight', 'vision_model.encoder.layers.4.self_attn.q_proj.weight', 'vision_model.encoder.layers.2.mlp.fc2.bias', 'vision_model.encoder.layers.0.layer_norm2.weight', 'vision_model.encoder.layers.5.self_attn.k_proj.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.bias', 'vision_model.encoder.layers.4.mlp.fc2.weight', 'vision_model.encoder.layers.1.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.mlp.fc1.weight', 'vision_model.encoder.layers.10.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.mlp.fc1.bias', 'vision_model.encoder.layers.2.layer_norm1.bias', 'vision_model.encoder.layers.9.layer_norm1.bias', 'vision_model.encoder.layers.10.self_attn.k_proj.weight', 'vision_model.encoder.layers.5.layer_norm2.bias', 'vision_model.encoder.layers.4.layer_norm1.bias', 'vision_model.encoder.layers.5.mlp.fc1.bias', 'logit_scale', 'vision_model.encoder.layers.4.self_attn.k_proj.bias', 'vision_model.encoder.layers.8.mlp.fc2.bias', 'vision_model.encoder.layers.10.self_attn.v_proj.bias', 'vision_model.encoder.layers.1.mlp.fc1.weight', 'vision_model.encoder.layers.5.self_attn.q_proj.weight', 'vision_model.encoder.layers.0.mlp.fc1.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.bias', 'vision_model.encoder.layers.1.layer_norm1.weight', 'vision_model.encoder.layers.10.layer_norm1.weight', 'vision_model.encoder.layers.9.self_attn.out_proj.bias', 'vision_model.encoder.layers.8.layer_norm2.weight', 'vision_model.encoder.layers.7.layer_norm1.weight', 'vision_model.encoder.layers.6.self_attn.out_proj.weight', 'vision_model.encoder.layers.10.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.layer_norm2.bias', 'vision_model.encoder.layers.10.mlp.fc1.weight', 'vision_model.encoder.layers.7.mlp.fc2.bias', 'vision_model.encoder.layers.11.layer_norm2.bias', 'vision_model.encoder.layers.1.self_attn.v_proj.bias', 'vision_model.encoder.layers.2.self_attn.k_proj.weight', 'vision_model.encoder.layers.11.mlp.fc2.weight', 'vision_model.encoder.layers.9.self_attn.out_proj.weight', 'vision_model.encoder.layers.2.self_attn.out_proj.bias', 'vision_model.encoder.layers.7.self_attn.q_proj.weight', 'vision_model.encoder.layers.2.layer_norm1.weight', 'vision_model.encoder.layers.5.mlp.fc2.weight', 'vision_model.encoder.layers.4.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.self_attn.v_proj.weight', 'vision_model.encoder.layers.11.mlp.fc1.weight', 'vision_model.encoder.layers.11.self_attn.q_proj.bias', 'vision_model.encoder.layers.0.self_attn.q_proj.weight', 'vision_model.encoder.layers.7.mlp.fc1.bias', 'vision_model.encoder.layers.7.self_attn.out_proj.bias', 'vision_model.encoder.layers.2.self_attn.out_proj.weight', 'vision_model.encoder.layers.8.mlp.fc1.bias', 'vision_model.encoder.layers.9.layer_norm2.bias', 'vision_model.post_layernorm.weight', 'vision_model.encoder.layers.1.mlp.fc2.weight', 'vision_model.encoder.layers.8.mlp.fc2.weight', 'vision_model.encoder.layers.5.layer_norm2.weight', 'vision_model.encoder.layers.6.self_attn.out_proj.bias', 'vision_model.encoder.layers.3.self_attn.q_proj.bias', 'vision_model.post_layernorm.bias', 'vision_model.encoder.layers.6.layer_norm1.bias', 'vision_model.embeddings.class_embedding', 'vision_model.encoder.layers.9.self_attn.q_proj.weight', 'vision_model.encoder.layers.9.mlp.fc1.weight', 'vision_model.encoder.layers.5.self_attn.out_proj.weight', 'vision_model.encoder.layers.6.self_attn.k_proj.weight', 'vision_model.encoder.layers.2.self_attn.v_proj.weight', 'vision_model.encoder.layers.5.self_attn.k_proj.weight', 'vision_model.encoder.layers.9.self_attn.v_proj.bias', 'vision_model.encoder.layers.3.layer_norm1.weight', 'vision_model.encoder.layers.11.layer_norm1.bias', 'vision_model.encoder.layers.0.mlp.fc2.weight', 'vision_model.encoder.layers.6.self_attn.k_proj.bias', 'vision_model.encoder.layers.8.self_attn.k_proj.weight', 'vision_model.encoder.layers.6.self_attn.v_proj.weight', 'vision_model.encoder.layers.3.self_attn.v_proj.bias', 'vision_model.encoder.layers.3.self_attn.k_proj.weight', 'vision_model.encoder.layers.3.layer_norm2.bias', 'vision_model.encoder.layers.2.self_attn.v_proj.bias', 'vision_model.encoder.layers.6.layer_norm1.weight', 'vision_model.encoder.layers.4.layer_norm2.bias', 'vision_model.encoder.layers.3.self_attn.out_proj.weight', 'vision_model.encoder.layers.7.self_attn.v_proj.bias', 'vision_model.encoder.layers.10.mlp.fc1.bias', 'vision_model.encoder.layers.3.mlp.fc2.bias', 'vision_model.encoder.layers.3.layer_norm2.weight', 'vision_model.encoder.layers.0.layer_norm2.bias', 'vision_model.encoder.layers.8.layer_norm1.bias', 'vision_model.encoder.layers.9.self_attn.k_proj.bias', 'vision_model.encoder.layers.3.self_attn.q_proj.weight', 'vision_model.encoder.layers.5.layer_norm1.weight', 'vision_model.encoder.layers.2.self_attn.q_proj.bias', 'vision_model.encoder.layers.7.mlp.fc2.weight', 'text_projection.weight', 'vision_model.encoder.layers.9.self_attn.k_proj.weight', 'vision_model.encoder.layers.1.self_attn.k_proj.weight', 'vision_model.encoder.layers.1.self_attn.q_proj.bias', 'vision_model.encoder.layers.2.mlp.fc1.bias', 'vision_model.encoder.layers.4.self_attn.q_proj.bias', 'vision_model.encoder.layers.5.mlp.fc1.weight']
- This IS expected if you are initializing CLIPTextModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CLIPTextModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Out[3]:
CLIPTextModel(
  (text_model): CLIPTextTransformer(
    (embeddings): CLIPTextEmbeddings(
      (token_embedding): Embedding(49408, 512)
      (position_embedding): Embedding(77, 512)
    )
    (encoder): CLIPEncoder(
      (layers): ModuleList(
        (0): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (1): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (2): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (3): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (4): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (5): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (6): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (7): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (8): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (9): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (10): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (11): CLIPEncoderLayer(
          (self_attn): CLIPAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (layer_norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (mlp): CLIPMLP(
            (activation_fn): QuickGELUActivation()
            (fc1): Linear(in_features=512, out_features=2048, bias=True)
            (fc2): Linear(in_features=2048, out_features=512, bias=True)
          )
          (layer_norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
)

Helper Functions and Classes¶

In the following section defines a lot of helpful classes and methods we use throughout the notebook to prepare, modify, and fit the data. Explanations for each sub-section are provided.

Utility Functions¶

Some utility functions to reduce code chunks and streamline some across-subject operations.

Visualizing and Correlating RDMs¶

The function below plots the neural or feature RDMs and allows different zooms. Additionally, we defined a function computing the correlation between neural and feature RDMs.

In [4]:
def plot_rdm(neural_rdms: dict = None, layer_features: dict = None, zoom: list = None):
    """Plots neural or feature RDMs for given subjects or models
    Args:
            neural_rdms: dictionary containing the neural RDMs for the left and right hemisphere
            layer_features: dictionary containing the layer features
            zoom: list containing the range of the x and y axis"""
    rdm_n, layers, size = (len(layer_features), list(layer_features.keys()), 25) if layer_features is not None \
                            else (len(neural_rdms), list(neural_rdms.keys()), 10)
             
    fig, ax = plt.subplots(1, rdm_n, figsize=(size, size))
    plt.subplots_adjust(wspace=0.2)
    for i, layer in enumerate(layers):
        if layer_features is not None:
            rdm = euclidean_distances(torch.tensor(layer_features[layer]).flatten(1).numpy())
            ax[i].set_xlabel(f"Correlation wit neural RDM: {corr_rdm(neural_rdms, rdm):.2f}")
        else:
            rdm = neural_rdms[layer]
            ax[i].set_xlabel("Trials", fontsize=8)
        ax[i].set_title(layer, fontsize=10)
        ax[i].imshow(rdm)

        if zoom is not None:
            ax[i].set_xlim(zoom)
            ax[i].set_ylim([zoom[-1], zoom[0]])
    plt.show()

def corr_rdm(neural_rdm: dict = None, layer_rdm: float = None):
    """Function to compute the correlation between neural and feature RDMs
    Args:
            neural_rdm: dictionary containing the neural RDMs for the left and right hemisphere
            layer_rdm: array containing the layer RDM"""
    layer_rdv = squareform(layer_rdm.round(5))
    neural_rdv = [squareform(n_rdm.round(5)) for n_rdm in neural_rdm.values()]
    
    return np.mean([spearmanr(n_rdv, layer_rdv)[0] for n_rdv in neural_rdv])
Shared Neural RDMs¶

The function below computes and prepares the neural RDMs from neural observations corresponding to the shared images (in the training data).

In [5]:
def neural_rdm_shared(subs: str = [f"subj0{i}" for i in range(1, 9)]):
    """"Function to compute the shared neural RDMs for the specified subjects
    Args:
            subs: list of strings indicating a subjects identification"""
    train_cap_file = pd.read_csv('data/algonauts_2023_caption_data.csv')

    # Save all shared trials per subject
    shared_img = []
    for sub in subs:
        dirs = Subject(sub)
        dirs.load_image_paths()

        img_match = [int(i[-9:-4]) for i in dirs.train_img_list]
        sub_df = train_cap_file[(train_cap_file['subject'] == sub) & (train_cap_file['nsdId'].isin(img_match))].reset_index(drop=True)
        shared_img.append(sub_df[(sub_df['n'] == 8)][['nsdId']])

    # Counting the shared images that occur in all subjects training split and saving the IDs
    counts_id = pd.concat(shared_img, axis=0, join="inner").groupby("nsdId").size()
    keep_id = counts_id[counts_id == 8].index.tolist()

    # Calculating the neural RDMs based on the retained IDs and corresponding indexes
    neural_rdms_shared = []
    for i, sub in enumerate(subs):
        dirs = Subject(sub)
        dirs.load_neural_data()

        shared_idx = shared_img[i][shared_img[i]["nsdId"].isin(keep_id)].index.values
        lh_fmri_shared, rh_fmri_shared = dirs.lh_fmri[shared_idx], dirs.rh_fmri[shared_idx]

        neural_rdms_shared.append(np.stack([euclidean_distances(lh_fmri_shared), euclidean_distances(rh_fmri_shared)]))
    
    return np.stack(neural_rdms_shared)
ROI Class Index¶

The following function returns the index of a given ROI class.

In [6]:
def roi_class_index(roi: str = "V1v"):
    """"Function to return the roi class index given the specified roi
    Args:
            subs: string indicating the roi"""
    
    if roi in ["V1v", "V1d", "V2v", "V2d", "V3v", "V3d", "hV4"]:
        roi_class = 0
    elif roi in ["EBA", "FBA-1", "FBA-2", "mTL-bodies"]:
        roi_class = 1
    elif roi in ["OFA", "FFA-1", "FFA-2", "mTL-faces", "aTL-faces"]:
        roi_class = 2
    elif roi in ["OPA", "PPA", "RSC"]:
        roi_class = 3
    elif roi in ["OWFA", "VWFA-1", "VWFA-2", "mfs-words", "mTL-words"]:
        roi_class = 4
    elif roi in ["early", "midventral", "midlateral", "midparietal", "ventral", "lateral", "parietal"]:
        roi_class = 5
    else:
        roi_class = 6

    return roi_class

ImageDataset & TextDataset¶

The ImageDataset and TextDataset classes are used to create datasets for PyTorch dataloaders. The ImageDataset is used for the NSD images and the TextDataset for the captions of the corresponding COCO images.

In [7]:
class ImageDataset(Dataset):
    """Class to prepare the image data for the PyTorch DataLoader"""
    def __init__(self, image_list, processor):
        self.image_list = image_list
        self.processor = processor

    def __len__(self):
        return len(self.image_list)

    def __getitem__(self, idx):
        image = Image.open(self.image_list[idx])
        image = self.processor(images=image, return_tensors="pt", padding=True)
        return image["pixel_values"].squeeze()

class TextDataset(Dataset):
    """"Class to prepare the text data for the PyTorch DataLoader"""
    def __init__(self, text, max_length, processor):
        self.text = processor(text=text, return_tensors="pt", padding="max_length", max_length=max_length)

    def __len__(self):
        return len(self.text["input_ids"])

    def __getitem__(self, idx):
        return self.text["input_ids"][idx]

Subject Class¶

The Subject class is initialized with a valid subject id (e.g., "subj01"). It stores all relevant paths and can load the data for the given subject.

In [8]:
class Subject:
    """Class to access all relevant data for a given subject"""
    def __init__(self, subject="subj01"):
        assert subject in ["subj01", "subj02", "subj03", "subj04", "subj05", "subj06", "subj07", "subj08",], "Invalid subject"
        self.subject = subject
        self.data_dir = "data/algonauts_2023_challenge_data"
        self.training_images_dir = f"{self.data_dir}/{subject}/training_split/training_images"
        self.test_images_dir = f"{self.data_dir}/{subject}/test_split/test_images"
        self.training_fmri_dir = f"{self.data_dir}/{subject}/training_split/training_fmri"
        self.roi_masks_dir = f"{self.data_dir}/{subject}/roi_masks"
        self.submission_dir = f"algonauts_2023_challenge_submission"
        # Load these as needed
        self.train_img_list = None
        self.test_img_list = None
        self.train_cap_list = None
        self.test_cap_list = None
        self.lh_fmri = None
        self.rh_fmri = None
        self.lh_roi_masks = None
        self.rh_roi_masks = None
        self.roi_name_maps = None
        self.lh_challenge_rois = None
        self.rh_challenge_rois = None
        self.train_img_dataloader = None
        self.test_img_dataloader = None
        self.train_cap_dataloader = None
        self.test_cap_dataloader = None            
        
    def load_image_paths(self) -> None:
        """Loads the image paths from the training and test directories"""
        self.train_img_list = glob.glob(f"{self.training_images_dir}/*.png")
        self.train_img_list.sort()
        self.test_img_list = glob.glob(f"{self.test_images_dir}/*.png")
        self.test_img_list.sort()
        # print(f"Training images: {len(self.train_img_list)}")
        # print(f"Test images: {len(self.test_img_list)}")

    def load_captions(self) -> None:
        """Loads and matches the captions from the csv file"""
        if self.train_img_list is None:
            self.load_image_paths()
        train_cap_file = pd.read_csv('data/algonauts_2023_caption_data.csv')
        img_match = [int(i[-9:-4]) for i in self.train_img_list]
        self.train_cap_list = train_cap_file[(train_cap_file['subject'] == self.subject) & (train_cap_file['nsdId'].isin(img_match))]['caption'].tolist()
        self.test_cap_list = train_cap_file[(train_cap_file['subject'] == self.subject) & (~train_cap_file['nsdId'].isin(img_match))]['caption'].tolist()
        # print(f"Training captions: {len(self.train_cap_list)}")
        # print(f"Test captions: {len(self.test_cap_list)}")
    
    def load_neural_data(self) -> None:
        """Loads the neural data from the .npy files"""
        self.lh_fmri = np.load(f"{self.training_fmri_dir}/lh_training_fmri.npy")
        self.rh_fmri = np.load(f"{self.training_fmri_dir}/rh_training_fmri.npy")
        # print(f"Left hemisphere neural data loaded. Shape: {self.lh_fmri.shape}")
        # print(f"Right hemisphere neural data loaded. Shape: {self.rh_fmri.shape}")

    def create_dataloaders(self, processor, batch_size) -> None:
        """Creates the dataloaders for the images and captions"""
        if self.train_img_list is None:
            self.load_image_paths()
        if self.train_cap_list is None:
            self.load_captions()
        max_caption_len = processor(text=self.train_cap_list + self.test_cap_list, return_tensors="pt", padding=True)["input_ids"].shape[1]   
        train_txt_dataset = TextDataset(self.train_cap_list, max_caption_len, processor)
        test_txt_dataset = TextDataset(self.test_cap_list, max_caption_len, processor)
        train_img_dataset = ImageDataset(self.train_img_list, processor)
        test_img_dataset = ImageDataset(self.test_img_list, processor)
        self.train_img_dataloader = DataLoader(train_img_dataset, batch_size=batch_size, shuffle=False)
        self.test_img_dataloader = DataLoader(test_img_dataset, batch_size=batch_size, shuffle=False)
        self.train_txt_dataloader = DataLoader(train_txt_dataset, batch_size=batch_size, shuffle=False)
        self.test_txt_dataloader = DataLoader(test_txt_dataset, batch_size=batch_size, shuffle=False)
        print(f"Train image dataloader: {len(self.train_img_dataloader)} batches")
        print(f"Test image dataloader: {len(self.test_img_dataloader)} batches")
        print(f"Train caption dataloader: {len(self.train_txt_dataloader)} batches")
        print(f"Test caption dataloader: {len(self.test_txt_dataloader)} batches")

    def load_challenge_rois(self) -> None:
        """Loads the challenge rois from the .npy files"""
        # Load the ROI classes mapping dictionaries
        roi_mapping_files = ['mapping_prf-visualrois.npy', 'mapping_floc-bodies.npy',
            'mapping_floc-faces.npy', 'mapping_floc-places.npy',
            'mapping_floc-words.npy', 'mapping_streams.npy']
        self.roi_name_maps = []
        for r in roi_mapping_files:
            self.roi_name_maps.append(np.load(f"{self.roi_masks_dir}/{r}", allow_pickle=True).item())

        # Load the ROI brain surface maps
        lh_challenge_roi_files = ['lh.prf-visualrois_challenge_space.npy',
            'lh.floc-bodies_challenge_space.npy', 'lh.floc-faces_challenge_space.npy',
            'lh.floc-places_challenge_space.npy', 'lh.floc-words_challenge_space.npy',
            'lh.streams_challenge_space.npy']
        rh_challenge_roi_files = ['rh.prf-visualrois_challenge_space.npy',
            'rh.floc-bodies_challenge_space.npy', 'rh.floc-faces_challenge_space.npy',
            'rh.floc-places_challenge_space.npy', 'rh.floc-words_challenge_space.npy',
            'rh.streams_challenge_space.npy']
        self.lh_challenge_rois = []
        self.rh_challenge_rois = []
        for r in range(len(lh_challenge_roi_files)):
            self.lh_challenge_rois.append(np.load(f"{self.roi_masks_dir}/{lh_challenge_roi_files[r]}"))
            self.rh_challenge_rois.append(np.load(f"{self.roi_masks_dir}/{rh_challenge_roi_files[r]}"))

    def load_roi_masks(self, roi="V1v", hemisphere="lh"):
        valid_roi = ["V1v", "V1d", "V2v", "V2d", "V3v", 
                     "V3d", "hV4", "EBA", "FBA-1", "FBA-2", 
                     "mTL-bodies", "OFA", "FFA-1", "FFA-2", 
                     "mTL-faces", "aTL-faces", "OPA", "PPA", 
                     "RSC", "OWFA", "VWFA-1", "VWFA-2", "mfs-words", 
                     "mTL-words", "early", "midventral", "midlateral", 
                     "midparietal", "ventral", "lateral", "parietal",
                     "all-vertices"]
        valid_hemisphere = ["lh", "rh"]
        assert roi in valid_roi, "Invalid ROI"
        assert hemisphere in valid_hemisphere, "Invalid hemisphere"

        # Define the ROI class based on the selected ROI
        roi_class = ['prf-visualrois', 'floc-bodies', 'floc-faces', 'floc-places', 'floc-words', 'streams' , 'all-vertices'][roi_class_index(roi)]

        roi_class_dir = f"{hemisphere}.{roi_class}_fsaverage_space.npy"
        roi_map_dir = f"mapping_{roi_class}.npy"
        fsaverage_roi_class = np.load(f"{self.roi_masks_dir}/{roi_class_dir}")
        roi_map = None
        if roi != "all-vertices":
            roi_map = np.load(f"{self.roi_masks_dir}/{roi_map_dir}", allow_pickle=True).item()
        return fsaverage_roi_class, roi_map

CLIPFeatureExtractor Class¶

The CLIPFeatureExtractor class is used to extract the hidden states from any CLIP model.

In [9]:
class CLIPFeatureExtractor():
    """Extracts the features from hidden states of a CLIP model."""
    def __init__(
            self, 
            idxs: list = [i for i in range(13)], # hidden layer indices to extract features from. Standard CLIP has an embedding layer and 12 transformer layers.
            last_hidden_layer: bool = False, # whether to extract features from the last hidden layer
            model: PreTrainedModel = None, # CLIP model
            dataloader: DataLoader = None, # dataloader for batching
            ) -> None:
        self.idxs = idxs
        self.last_hidden_layer = last_hidden_layer
        self.generate_feature_dict()
        if self.last_hidden_layer:
            self.idxs.append(13) # adds an additional idx to allow for loop zip()
        self.model = model
        self.dataloader = dataloader
        print(f"idxs: {self.idxs}")
        print(f"feature dict keys: {self.feature_dict.keys()}")
    
    def generate_feature_dict(self) -> None:
        """Generates a feature dict according to the idxs and last_hidden_layer attributes."""
        feature_dict = {}
        for idx in self.idxs:
            if idx == 0:
                feature_dict["Embedding Layer"] = None
            else:
                feature_dict[f"Transformer Layer {idx}"] = None
        if self.last_hidden_layer:
            feature_dict["Final Layer"] = None
        self.feature_dict = feature_dict
    
    def concat_features(self, features: dict) -> None:
        """Adds extracted features to the feature dict.
        Args:
            features: features extracted from the output of a CLIP model"""
        keys = list(self.feature_dict.keys())
        # check if feature_dict is empty
        if self.feature_dict[keys[0]] is None:
            self.feature_dict = features
        else:
            for key in keys:
                self.feature_dict[key] = np.concatenate((self.feature_dict[key], features[key]), axis=0)

    def extract_raw_features(self, output) -> None: 
        """Extracts features from the hidden states of a CLIP model and concates them to the feature_dict.
        Args:
            output: output of a CLIP model
        """
        features = {}
        for idx, key in zip(self.idxs, self.feature_dict.keys()):
            if key == "Final Layer":
                features[key] = output.last_hidden_state.cpu().detach().numpy()
            else:
                features[key] = output.hidden_states[idx].cpu().detach().numpy()
        self.concat_features(features)
    
    def extract_raw_features_from_model(self) -> None:
        """Runs the CLIP model on the dataloader and extracts features from the hidden states."""
        self.model = self.model.to(device)
        with torch.no_grad():
            for batch in tqdm(self.dataloader):
                batch = batch.to(device)
                output = self.model(batch, output_hidden_states=True)
                self.extract_raw_features(output)
                batch = None # clear batch from memory
                output = None # clear output from memory
        self.model = self.model.to("cpu")        

KFoldProcedure & KFold Classes¶

The KFoldProcedure class is used to define a procedure for each fold of the k-fold validation. It can be supplied to a KFold class which executes its run() function on all folds.

In [10]:
class KFoldProcedure:
    """This class is used to define a procedure that is run on each fold of a k-fold cross validation."""
    def __init__(self) -> None:
        assert isinstance(self.model_name, str) and len(self.model_name) > 0, "Please define a model name as part of the KFold Procedure."
        assert isinstance(self.description, str ) and len(self.description) > 0 , "Please define a description as part of the KFold Procedure."

    def prepare(self) -> None:
        """Operations that should be executed before the fold loop"""
        raise NotImplementedError

    def run(self, train_idxs: np.ndarray, val_idxs: np.ndarray) -> Dict[str, Dict[str, np.ndarray]]:
        """This should return a dict of correlations.
        dict format: {"layer": {"lh": np.ndarray, "lh": np.ndarray}}"""
        raise NotImplementedError
    
    def return_idxs(self):
        """Returns idxs to create folds in the KFold class."""
        raise NotImplementedError
    
    def return_roi_names(self) -> List[str]:
        """Required for the plot function in the KFold class."""
        return self.roi_names    
    
    def return_model_name_and_description(self) -> Tuple[str, str]:
        return self.model_name, self.description

    def calculate_correlations(self, lh_pred, rh_pred, lh_fmri, rh_fmri) -> Tuple[np.ndarray, np.ndarray]:
        """Calculate correlation between prediction and fmri activation"""
        lh_correlation = np.zeros(lh_pred.shape[1])
        for v in tqdm(range(lh_pred.shape[1])):
            lh_correlation[v] = corr(lh_pred[:,v], lh_fmri[:,v])[0]
        # Right hemisphere
        rh_correlation = np.zeros(rh_pred.shape[1])
        for v in tqdm(range(rh_pred.shape[1])):
            rh_correlation[v] = corr(rh_pred[:,v], rh_fmri[:,v])[0]
        return lh_correlation, rh_correlation

    def calculate_median_correlations(self, lh_correlation, rh_correlation) -> Tuple[np.ndarray, np.ndarray]:
        """Calculate median correlation for each ROI."""
        # Select the correlation results vertices of each ROI
        lh_challange_rois = self.lh_challenge_rois
        rh_challange_rois = self.rh_challenge_rois
        self.roi_names = []
        lh_roi_correlation = []
        rh_roi_correlation = []
        for r1 in range(len(lh_challange_rois)):
            for r2 in self.roi_name_maps[r1].items():
                if r2[0] != 0: # zeros indicate to vertices falling outside the ROI of interest
                    self.roi_names.append(r2[1])
                    lh_roi_idx = np.where(lh_challange_rois[r1] == r2[0])[0]
                    rh_roi_idx = np.where(rh_challange_rois[r1] == r2[0])[0]
                    lh_roi_correlation.append(lh_correlation[lh_roi_idx])
                    rh_roi_correlation.append(rh_correlation[rh_roi_idx])
        self.roi_names.append('All vertices')
        lh_roi_correlation.append(lh_correlation)
        rh_roi_correlation.append(rh_correlation)
        lh_median_roi_correlation = [np.median(lh_roi_correlation[r])
            for r in range(len(lh_roi_correlation))]
        rh_median_roi_correlation = [np.median(rh_roi_correlation[r])
            for r in range(len(rh_roi_correlation))]
        return lh_median_roi_correlation, rh_median_roi_correlation
    
class KFold:
    """Run a k-fold cross validation with a given procedure."""
    def __init__(self, folds: int = 8, seed: int = 5, procedure: KFoldProcedure = None) -> None:
        assert folds > 1, "folds must be greater than 1"
        assert seed > 0, "seed must be greater than 0"
        assert isinstance(folds, int), "folds must be an integer"
        assert isinstance(seed, int), "seed must be an integer"
        #assert isinstance(procedure, KFoldProcedure), "procedure must be an instance of KFoldProcedure"
        self.folds = folds
        self.seed = seed
        self.procedure = procedure
        self.fold_correlations = {}
        self.mean_correlations = None

    def run(self) -> None:
        """Runs the procedure on each fold and accesses the correlations."""
        self.procedure.prepare()
        # Create k folds   
        fold_idxs = self.procedure.return_idxs()
        np.random.seed(self.seed)
        np.random.shuffle(fold_idxs)
        self.fold_idxs = np.array_split(fold_idxs, self.folds)

        for fold in range(self.folds):
            # Select validation and train set
            val_idxs = self.fold_idxs[fold]
            train_idxs = np.concatenate([self.fold_idxs[j] for j in range(self.folds) if j != fold])
            
            # Info for current fold
            print(f"#############################################")
            print(f"# Fold: {fold+1}/ {self.folds}")         
            print(f"# Train size: {len(train_idxs)}")
            print(f"# Validation size: {len(val_idxs)}")
            print(f"#############################################")

            # Run procedure
            self.fold_correlations[fold] = self.procedure.run(train_idxs, val_idxs)
        # Get model name and description
        self.model_name, self.description = self.procedure.return_model_name_and_description()
        self.roi_names = self.procedure.return_roi_names()
        self.calculate_mean_accross_folds()
        self.mean_correlations_to_csv()
    
    def calculate_mean_accross_folds(self):
        """Calculates the mean across folds for each layer"""
        self.mean_correlations = {}
        for layer in self.fold_correlations[0].keys():
            self.mean_correlations[layer] = {}
            for hemi in self.fold_correlations[0][layer].keys():
                self.mean_correlations[layer][hemi] = np.nanmean([self.fold_correlations[fold][layer][hemi] for fold in range(self.folds)], axis=0)
    
    def mean_correlations_to_csv(self) -> None:
        df = pd.DataFrame(columns=["model", "layer", "hemisphere", "roi", "correlation"])
        for layer in self.mean_correlations.keys():
                for hemisphere in self.mean_correlations[layer].keys():
                    for i in range(len(self.roi_names)):
                        df = df.append({"model": self.model_name, "layer": layer, "hemisphere": hemisphere, "roi": self.roi_names[i], "correlation": self.mean_correlations[layer][hemisphere][i]}, ignore_index=True)

        validations = glob.glob(f"validations/validation*")
        if len(validations) == 0:
            # create first validation folder
            folder_name = "validation001"
            os.mkdir(f"validations/{folder_name}")
        else:
            # create next validation folder
            last_validation = sorted(validations)[-1]
            last_validation_number = int(last_validation.split("/")[-1].split("validation")[-1])
            next_validation_number = last_validation_number + 1
            folder_name = f"validation{str(next_validation_number).zfill(3)}"
            os.mkdir(f"validations/{folder_name}")

        # Write text file with model description
        with open(f"validations/{folder_name}/info.txt", "w") as f:
            f.write(self.description)

        # Save dataframe
        df.to_csv(f"validations/{folder_name}/results.csv", index=False)

Additionally we define a function to plot the validation results.

In [11]:
def plot_kfold_result(validation = "001"):
    """Plots the validation results from the csv file in the given validaiton folder."""
    folder = f"validations/validation{validation}"
    with open(f"{folder}/info.txt", 'r') as f:
        info = f.read()
    df = pd.read_csv(f"{folder}/results.csv")
    # drop model column
    df = df.drop("model", axis=1)

    # Define color palette and assign colors to layers
    palette = sns.color_palette("colorblind", 14)
    layer_colors = {layer: palette[i] for i, layer in enumerate(df.layer.unique())}

    # Split data into left and right hemispheres
    left_df = df[df['hemisphere'] == 'lh']
    right_df = df[df['hemisphere'] == 'rh']

    # Create bar plots
    fig, axes = plt.subplots(2, 1, figsize=(30, 10))
    fig.suptitle(info)
    plt.subplots_adjust(hspace=0.5)

    bar_width = 0.05
    # Plot left hemisphere data
    for i, layer in enumerate(df.layer.unique()):
        layer_data = left_df[left_df['layer'] == layer]
        x = np.arange(len(layer_data['roi']))
        # center bars around xtick
        axes[0].bar(x - len(df.layer.unique())/2 * 0.05 + i * 0.05, layer_data['correlation'], width=bar_width, label=layer, color=layer_colors[layer])

    axes[0].margins(x=0.01) # reduce white space before first x tick
    axes[0].set_xticks([i for i in range(len(layer_data['roi']))])
    axes[0].set_xticklabels([roi for roi in layer_data['roi']])
    axes[0].set_title('Left Hemisphere')
    axes[0].set_xlabel('ROI')
    axes[0].tick_params(axis='x', labelrotation=45)
    axes[0].set_ylabel('Correlation')
    axes[0].legend(loc='upper center', bbox_to_anchor=(0.5, 1), ncol=5)
    axes[0].set_ylim([0, 1])
    row_colors = [layer_colors[layer] for layer in left_df.groupby('layer').mean().sort_values(by='correlation', ascending=False).index]
    axes[0].table(cellText=left_df.groupby('layer').mean().sort_values(by='correlation', ascending=False).round(3).values, rowLabels=df.groupby('layer').mean().sort_values(by='correlation', ascending=False).index, colLabels=["Mean Correlation"], rowColours=row_colors, bbox = [0.95, 0.5, 0.05, 0.5])


    # Plot right hemisphere data
    for i, layer in enumerate(df.layer.unique()):
        layer_data = right_df[right_df['layer'] == layer]
        x = np.arange(len(layer_data['roi']))
        axes[1].bar(x - len(df.layer.unique())/2 * 0.05 + i * 0.05, layer_data['correlation'], width=bar_width, label=layer, color=layer_colors[layer])
    
    axes[1].margins(x=0.01)
    axes[1].set_xticks([i for i in range(len(layer_data['roi']))])
    axes[1].set_xticklabels([roi for roi in layer_data['roi']])
    axes[1].set_title('Right Hemisphere')
    axes[1].set_xlabel('ROI')
    axes[1].tick_params(axis='x', labelrotation=45)
    axes[1].set_ylabel('Correlation')
    axes[1].legend(loc='upper center', bbox_to_anchor=(0.5, 1), ncol=5)
    axes[1].set_ylim([0, 1])
    row_colors = [layer_colors[layer] for layer in right_df.groupby('layer').mean().sort_values(by='correlation', ascending=False).index]
    axes[1].table(cellText=right_df.groupby('layer').mean().sort_values(by='correlation', ascending=False).round(3).values, rowLabels=df.groupby('layer').mean().sort_values(by='correlation', ascending=False).index, colLabels=["Mean Correlation"], rowColours=row_colors, bbox = [0.95, 0.5, 0.05, 0.5])

    plt.show()

Data Exploration¶

Subject¶

Which subject would you like to investigate?

In [13]:
# Select the subject
subject = "subj01" # ["subj01", "subj02", "subj03", "subj04", "subj05", "subj06", "subj07", "subj08"]
dirs = Subject(subject)

Load the fMRI training data¶

First we load the fMRI training split data of the selected subject. The fMRI data consists of two '.npy' files:

  • lh_training_fmri.npy: the left hemisphere (LH) fMRI data.
  • rh_training_fmri.npy: the right hemisphere (RH) fMRI data.

Both files are 2-dimensional arrays with training stimulus images as rows and fMRI vertices as columns.

In [14]:
# Load Neural data
dirs.load_neural_data()
lh_fmri, rh_fmri = dirs.lh_fmri, dirs.rh_fmri

print(f"Left hemisphere neural data loaded. Shape: {lh_fmri.shape}")
print(f"Right hemisphere neural data loaded. Shape: {rh_fmri.shape}")
Left hemisphere neural data loaded. Shape: (9841, 19004)
Right hemisphere neural data loaded. Shape: (9841, 20544)

Let's look at the neural RDMs for both hemispheres of subject 1.

In [15]:
# Store the neural RDMs
neural_rdms = {'Left Hemisphere RDM': euclidean_distances(lh_fmri), 
               'Right Hemisphere RDM': euclidean_distances(rh_fmri)}

plot_rdm(neural_rdms)

Due to the size of the RDMs, clusters are difficult to spot. We therefore zoom into a section to identify some clusters. (Feel free to play around with the zoom.)

In [16]:
plot_rdm(neural_rdms, zoom=[1200, 1500])

RDM clustering¶

The above RDM plots show (dis-)similarity clustering in the neural activity in response to the viewing of different natural scene stimuli for both the left and right hemispheres. This means that some trial stimuli elecit more (dis-)similar brain activity than other trials. Intuitively this makes sense, as the COCO dataset contains a wide variety of stimuli that vary in similarity, not unlike the clustering in the study of Jozwik et al. (2016). In other words, this clustering of RDMs of COCO images occurs because the neural representations of images with similar content or features may share similar patterns of activation in the brain or neural network. For example, if we compute the RDMs of the neural responses to images with similar backgrounds or scenes, such as images of beaches or mountains, the RDMs for these images may cluster together because the neural representations of the scene images share some common features such as colors, textures, and shapes that are unique to that scene. However, in this study we did not look into what COCO images showed high (dis-)similarity, but presumably the clustering is comparable to the clusting in Jozwik et al. (2016).

Noise Ceiling¶

Next, we evaluate what model performance we could expect, not only compared to other groups, but given the data and its associated signal and noise levels. To that extent, we decided to investigate the noise ceiling of our data, which gives us an indication of how much variance our model should preferably explain and the range of noise within the data. As subjects viewed different pictures for the majority of the experiment, averaging across their full neural RDMs is more difficult, especially as the number of trials differ. We therefore extracted the images that were shared across subjects and calculated the noise ceiling based on the corresponding subset of neural observations.

In [17]:
subs = [f"subj0{i}" for i in range(1, 9)]
neural_rdms_shared = neural_rdm_shared(subs = subs)

# Calculate Noise Ceiling
NCs = np.zeros((len(subs), 2))
rdms_subjects = np.mean(neural_rdms_shared, axis=1)
rdms_average_upp = np.mean(rdms_subjects, axis=0)
rdms_average_low = [np.mean(rdms_subjects[np.arange(len(subs)) != i,:,:], axis=0) for i in range(len(subs))]

# Computing the Individual Lower Noise Ceiling Bound
NCs[:, 0] = [spearmanr(squareform(rdms_subjects[i,:,:]), squareform(rdms_average_low[i]))[0] for i in range(len(subs))]
# Computing the Individual Upper Noise Ceiling Bound
NCs[:, 1] = [spearmanr(squareform(rdms_subjects[i,:,:]), squareform(rdms_average_upp))[0] for i in range(len(subs))]
# Compute Average Upper & Lower Noise Ceiling Bound 
mean_NC = np.mean(NCs, axis=0)

print(f'The Noise Ceiling for the shared images has a Lower Bound of {mean_NC[0]:.2f} and an Upper Bound of {mean_NC[1]:.2f}')
(noiseCeiling := pd.DataFrame({"LowerBound": [mean_NC[0]], "UpperBound": [mean_NC[1]]}))
The Noise Ceiling for the shared images has a Lower Bound of 0.53 and an Upper Bound of 0.65
Out[17]:
LowerBound UpperBound
0 0.5344 0.654185
In [18]:
# Plot Noise Ceiling
plt.figure(figsize=(10,5))
plt.plot([0,1], [noiseCeiling['UpperBound'], noiseCeiling['UpperBound']], 'r--', label='Upper bound')
plt.plot([0,1], [noiseCeiling['LowerBound'], noiseCeiling['LowerBound']], 'b--', label='Lower bound')
plt.fill_between([0,1], noiseCeiling['UpperBound'], noiseCeiling['LowerBound'], color='grey', alpha=0.5)
plt.fill_between([0,1], noiseCeiling['LowerBound'], 0, color='khaki', alpha=0.2)
plt.ylabel('Spearman correlation')
plt.ylim([0,1])
plt.xticks([])
plt.title('Noise ceiling')
plt.legend()
plt.margins(x=0.01) 
plt.show()

The Noise Ceiling indicates that there is a substantial amount of variance that can be explained and that the amount not reliably shared across subjects (i.e., noise) is rather thin/small. This finding gives us a good indication of what to expect when evaluating the later model fits.

Load the Stimulus Images¶

All images consist of natural scenes coming from the COCO dataset. The images are divided into a training and a test split (corresponding to the fMRI training and test data splits). The amount of training and test images varies between subjects.

In [19]:
# Load image paths
dirs.load_image_paths()
train_img_list, test_img_list = dirs.train_img_list, dirs.test_img_list

print(f"Training images: {len(train_img_list)}")
print(f"Test images: {len(test_img_list)}")
Training images: 9841
Test images: 159

Load the Stimulus Captions¶

All NSD images also have a caption associated with them that can be traced back to their origin in the COCO dataset (visible above each image in the dataset). First, a matching between new NSD and old COCO IDs was retrieved, available in the NSD AWS repository under nsd_stim_info_merged.csv. Next, JSON files containing the captions and corresponding IDs were downlaoded from the official COCO Website. Lastly, captions were matched based on old and new IDs.

In [20]:
# Load captions
dirs.load_captions()
train_cap_list, test_cap_list = dirs.train_cap_list, dirs.test_cap_list
max_caption_len = processor(text=train_cap_list + test_cap_list, return_tensors="pt", padding=True)["input_ids"].shape[1]

print(f"Training captions: {len(train_cap_list)}")
print(f"Test captions: {len(test_cap_list)}")
print(f"Max caption length: {max_caption_len}")
Training captions: 9841
Test captions: 159
Max caption length: 46

Visualize the fMRI ROIs and Training Images¶

The visual cortex is divided into multiple areas having different functional properties, referred to as regions-of-interest (ROIs). Along with the fMRI data, ROI indices are provided for selecting vertices belonging to specific visual ROIs.

Following is the list of ROIs (ROI class file names in parenthesis):

  • Early retinotopic visual regions: V1v, V1d, V2v, V2d, V3v, V3d, hV4.
  • Body-selective regions: EBA, FBA-1, FBA-2, mTL-bodies.
  • Face-selective regions: OFA, FFA-1, FFA-2, mTL-faces, aTL-faces.
  • Place-selective regions: OPA, PPA, RSC.
  • Word-selective regions: OWFA, VWFA-1, VWFA-2, mfs-words, mTL-words.
  • Anatomical streams: early, midventral, midlateral, midparietal, ventral, lateral, parietal.

Next we plot the vertices belonging to a specified fMRI ROI. An additional example image (stimulus) with caption is displayed.

In [21]:
img = 354 #@param
hemisphere = "left" # ['left', 'right']
roi = "EBA" # ["all-vertices", "V1v", "V1d", "V2v", "V2d", "V3v", "V3d", "hV4", "EBA", "FBA-1", "FBA-2", "mTL-bodies", "OFA", "FFA-1", "FFA-2", "mTL-faces", "aTL-faces", "OPA", "PPA", "RSC", "OWFA", "VWFA-1", "VWFA-2", "mfs-words", "mTL-words", "early", "midventral", "midlateral", "midparietal", "ventral", "lateral", "parietal"]

# Load the image
img_dir = os.path.join(train_img_list[img])
train_img = Image.open(img_dir).convert('RGB')

# Plot the image
plt.figure()
plt.axis('off')
plt.imshow(train_img)
plt.title('Training image: ' + str(img+1) + '\n' + train_cap_list[img]);

fsaverage_roi_class, roi_map = dirs.load_roi_masks(roi, "lh" if hemisphere == "left" else "rh")
if roi != "all-vertices":
    dirs.load_challenge_rois()
    challenge_roi_class = dirs.lh_challenge_rois if hemisphere == "left" else dirs.rh_challenge_rois

    roi_mapping = list(roi_map.keys())[list(roi_map.values()).index(roi)]
    fsaverage_roi = np.asarray(fsaverage_roi_class == roi_mapping, dtype=int)
    challenge_roi = np.asarray(challenge_roi_class[roi_class_index(roi)] == roi_mapping, dtype=int)

    # Map the fMRI data onto the brain surface map
    fsaverage_response = np.zeros(len(fsaverage_roi))
    if hemisphere == 'left':
        fsaverage_response[np.where(fsaverage_roi)[0]] = lh_fmri[img,np.where(challenge_roi)[0]]
    elif hemisphere == 'right':
        fsaverage_response[np.where(fsaverage_roi)[0]] = rh_fmri[img,np.where(challenge_roi)[0]]

else:
    fsaverage_response = np.zeros(len(fsaverage_roi_class))
    if hemisphere == 'left':
        fsaverage_response[np.where(fsaverage_roi_class)[0]] = lh_fmri[img]
    elif hemisphere == 'right':
        fsaverage_response[np.where(fsaverage_roi_class)[0]] = rh_fmri[img]

# # Create the interactive brain surface map
fsaverage = datasets.fetch_surf_fsaverage('fsaverage')
view = plotting.view_surf(
    surf_mesh=fsaverage['infl_'+hemisphere],
    surf_map=fsaverage_response,
    bg_map=fsaverage['sulc_'+hemisphere],
    threshold=1e-14,
    cmap='cold_hot',
    colorbar=True,
    title=roi+', '+hemisphere+' hemisphere'
    )
view
Out[21]:

Methods: Training and Evaluating Linearizing Encoding Models¶

This section serves as an illustraton of our procedure. For the final submission/k-fold model fits, we will use a predefined pipeline that does all of the data preparation and model fitting procedures automatically, which you can look up in more detail under Helper Functions and Classes.

Model Approach and Data Preparation¶

To build and evaluate linearizing encoding models, we use the CLIP architecture pre-trained by OpenAI. We train the model on a training partition and cross-validate it on the validation partitions (an independent partition of the training data). The linearizing encoding algorithm involves the following general steps:

  1. Split the data into training, validation and test partitions.

  2. Extract image and/or text features from CLIP (Vision and/or Text model) providing either images and/or captions.

  3. Linearly map the CLIP image/text data to fMRI responses. Usually, these data have a substantial amount of features; We therefore use PCA to reduce dimensionality (when appropriate).

  4. Evaluate and visualize the encoding model's prediction accuracy (i.e., encoding accuracy) using the validation partition.

In [22]:
rand_seed = 5
np.random.seed(rand_seed)

# Calculate how many stimulus images correspond to 90% of the training data
num_train = int(np.round(len(train_img_list)*0.9))
# Shuffle all training stimulus images
idxs = np.arange(len(train_img_list))
np.random.shuffle(idxs)
# Assign 90% of the shuffled stimulus images to the training partition,
# and 10% to the test partition
idxs_train, idxs_val = idxs[:num_train], idxs[num_train:]
# No need to shuffle or split the test stimulus images
idxs_test = np.arange(len(test_img_list))

print('Training stimulus images: ' + format(len(idxs_train)))
print('Validation stimulus images: ' + format(len(idxs_val)))
print('Test stimulus images: ' + format(len(idxs_test)))
Training stimulus images: 8857
Validation stimulus images: 984
Test stimulus images: 159

Step 1: Splitting the training fMRI data into training and validation partitions using the previously defined indices.

In [23]:
# Split neural data into train and validation
lh_fmri_train = lh_fmri[idxs_train]
lh_fmri_val = lh_fmri[idxs_val]
rh_fmri_train = rh_fmri[idxs_train]
rh_fmri_val = rh_fmri[idxs_val]

In order to compare the neural RDMs to later feature/layer RDMs from our model activations, we will store both hemispheric neural rdms.

In [24]:
neural_rdms_train = {'Left Hemisphere RDM': euclidean_distances(lh_fmri_train), 
                     'Right Hemisphere RDM': euclidean_distances(rh_fmri_train)}

Good memory management:

In [ ]:
del lh_fmri, rh_fmri

As we are working with a PyTorch instantiation of CLIP, we are using the corresponding PyTorch Dataset and DataLoader classes to create our three data partitions.

In [25]:
batch_size = 400

# Prepare data for dataloader 
train_img_dataset = ImageDataset(list(np.array(train_img_list)[idxs_train]), processor)
val_img_dataset = ImageDataset(list(np.array(train_img_list)[idxs_val]), processor)
test_img_dataset = ImageDataset(test_img_list, processor)

train_img_dataloader = DataLoader(train_img_dataset, batch_size=batch_size, shuffle=False)
val_img_dataloader = DataLoader(val_img_dataset, batch_size=batch_size, shuffle=False)
test_img_dataloader = DataLoader(test_img_dataset, batch_size=batch_size, shuffle=False)

print(f"Train caption dataloader: {len(train_img_dataloader)} batches")
print(f"Validation caption dataloader: {len(val_img_dataloader)} batches")
print(f"Test caption dataloader: {len(test_img_dataloader)} batches")
Train caption dataloader: 23 batches
Validation caption dataloader: 3 batches
Test caption dataloader: 1 batches

Running the model¶

For illustrative purposes, we will execute an example run of our approach, involving CLIP's Vision model and a subselection of feature layers we found interesting when running more elaborate k-fold validations. By splitting up the entire procedure, we will guide you through our reasoning regarding model fitting decisions and showcase the utilized model evaluation steps.

As a first step, we will extract raw image features for the transformer layers 4, 7, 11, 12, and 13 (final layer). The CLIPFeatureExtractor class takes the required arguments and returns the desired feature spaces.

In [26]:
raw_features = []
for loader in [train_img_dataloader, val_img_dataloader, test_img_dataloader]:
    feature_extractor = CLIPFeatureExtractor(idxs=[4, 7, 11, 12], last_hidden_layer=True, model=vis_model, dataloader=loader)
    feature_extractor.extract_raw_features_from_model()
    raw_features.append(feature_extractor.feature_dict)

del feature_extractor
idxs: [4, 7, 11, 12, 13]
feature dict keys: dict_keys(['Transformer Layer 4', 'Transformer Layer 7', 'Transformer Layer 11', 'Transformer Layer 12', 'Final Layer'])
100%|██████████| 23/23 [03:05<00:00,  8.06s/it]
idxs: [4, 7, 11, 12, 13]
feature dict keys: dict_keys(['Transformer Layer 4', 'Transformer Layer 7', 'Transformer Layer 11', 'Transformer Layer 12', 'Final Layer'])
100%|██████████| 3/3 [00:17<00:00,  5.72s/it]
idxs: [4, 7, 11, 12, 13]
feature dict keys: dict_keys(['Transformer Layer 4', 'Transformer Layer 7', 'Transformer Layer 11', 'Transformer Layer 12', 'Final Layer'])
100%|██████████| 1/1 [00:02<00:00,  2.70s/it]

Before we go into any fitting procedure, let's check out the RDMs of the visual features we extracted per layer and their correlations with the neural rdms stored above for the training neural observations. This gives us an indication of what layers might be more interesting to investigate in the model fitting.

In [27]:
plot_rdm(neural_rdms_train, raw_features[0])

The above feature RDMs show an interesting trend: The higher the layers, the closer are the computed pairwise distances. Notably, transformer layer 7 seems to show the highest correlations with the neural RDMS, making it a good candidate for the following fitting process. Although this layer shows the highest correlation, it is still far from what could potentially be explained in the data given the above Noise Ceiling.

Keep in mind when looking at the above correlations and later results:

  • Our overall goal is to predict the fMRI activity in the hemispheres and correlate the prediction with the actual fMRI pattern (on the validation split)
  • The above, however, is a direct correlation between layer and neural activations!

However, one problem remains before we move on to the modeling: The dimensionality of our current feature space is too high, potentially resulting in overfitting the training data, model assumption violations, and computational bottlenecks (as already noticable when executing the plotting on the single feature RDMs). Therefore, we define a PCA to reduce the dimensionality of our preferred layer in order to fit a reasonable amount of predictors for our model (linear regression). Additionally, we visually check how the variance is distributed across the PCA components.

In [28]:
# PCA Components
pca_components = 1000
pca = PCA(n_components=pca_components)

example_pca = pca.fit_transform(torch.tensor(raw_features[0]['Transformer Layer 7']).flatten(1).numpy())

# Evaluate how the variance is distributed across the PCA components
plt.plot(np.arange(pca_components), pca.explained_variance_ratio_)
plt.xlabel('Component')
plt.ylabel('Variance explained')
plt.title('Total variance explained: ' + str(np.sum(pca.explained_variance_ratio_)));

As we can see, we are by no means capturing the entire variation within our image features (transformer layer 7). However, 1000 components seem not worth the small additional variance they capture. Therefore, we will use 200 principal components for our subsequent linear models as, judging by the plot, they capture a major part of the variation already. Let's fit a PCA and transform the training image features accordingly.

In [29]:
pca_components = 200
pca = PCA(n_components=pca_components)

train_img_pca_features = pca.fit_transform(torch.tensor(raw_features[0]['Transformer Layer 7']).flatten(1).numpy())

# Fit linear regressions on the training data
img_reg_lh = LinearRegression().fit(train_img_pca_features, lh_fmri_train)
img_reg_rh = LinearRegression().fit(train_img_pca_features, rh_fmri_train)

Next, we transform the validation image features according to the fitted PCA and use these transformed features to predict the left and right hemisphere fMRI activity.

In [30]:
val_img_pca_features = pca.transform(torch.tensor(raw_features[1]['Transformer Layer 7']).flatten(1).numpy())

# Use fitted linear regressions to predict the validation fMRI data
lh_fmri_val_pred = img_reg_lh.predict(val_img_pca_features)
rh_fmri_val_pred = img_reg_rh.predict(val_img_pca_features)

Model Performance and Statistical Evaluation¶

Now that we fitted a model and predicted the outcome variable of interest, let's evaluate how good the chosen feature space and modelling approach are. To that extent, we will compute the correlations between predicted and actual fMRI activity for the validation set (for both, left and right hemisphere).

In [31]:
# Empty correlation array of shape: (LH vertices)
lh_correlation = np.zeros(lh_fmri_val_pred.shape[1])
# Correlate each predicted LH vertex with the corresponding ground truth vertex
for v in tqdm(range(lh_fmri_val_pred.shape[1])):
    lh_correlation[v] = corr(lh_fmri_val_pred[:,v], lh_fmri_val[:,v])[0]

# Empty correlation array of shape: (RH vertices)
rh_correlation = np.zeros(rh_fmri_val_pred.shape[1])
# Correlate each predicted RH vertex with the corresponding ground truth vertex
for v in tqdm(range(rh_fmri_val_pred.shape[1])):
    rh_correlation[v] = corr(rh_fmri_val_pred[:,v], rh_fmri_val[:,v])[0]
100%|██████████| 19004/19004 [00:16<00:00, 1175.09it/s]
100%|██████████| 20544/20544 [00:17<00:00, 1172.23it/s]

A nice way to illustrate the resulting correlations is by visualizing the corresponding encoding accuracy of all vertices on a brain surface.

In [32]:
hemisphere = 'left' # ['left', 'right']

# Load the brain surface map of all vertices
fsaverage_all_vertices, _ = dirs.load_roi_masks("all-vertices", "lh" if hemisphere == "left" else "rh")

# Map the correlation results onto the brain surface map
fsaverage_correlation = np.zeros(len(fsaverage_all_vertices))
if hemisphere == 'left':
    fsaverage_correlation[np.where(fsaverage_all_vertices)[0]] = lh_correlation
elif hemisphere == 'right':
    fsaverage_correlation[np.where(fsaverage_all_vertices)[0]] = rh_correlation

# Create the interactive brain surface map
fsaverage = datasets.fetch_surf_fsaverage('fsaverage')
view = plotting.view_surf(
    surf_mesh=fsaverage['infl_'+hemisphere],
    surf_map=fsaverage_correlation,
    bg_map=fsaverage['sulc_'+hemisphere],
    threshold=1e-14,
    cmap='cold_hot',
    colorbar=True,
    title='Encoding accuracy, '+hemisphere+' hemisphere'
    )
view
Out[32]:

Finally, let's plot the achieved median correlation for each ROI in an overall bar plot, split into left and right hemisphere.

In [33]:
dirs.load_challenge_rois()

# Load the ROI classes mapping dictionaries
roi_name_maps = dirs.roi_name_maps

# Load the ROI brain surface maps
lh_challenge_rois, rh_challenge_rois = dirs.lh_challenge_rois, dirs.rh_challenge_rois

# Select the correlation results vertices of each ROI
roi_names = []
lh_roi_correlation = []
rh_roi_correlation = []
for r1 in range(len(lh_challenge_rois)):
    for r2 in roi_name_maps[r1].items():
        if r2[0] != 0: # zeros indicate to vertices falling outside the ROI of interest
            roi_names.append(r2[1])
            lh_roi_idx = np.where(lh_challenge_rois[r1] == r2[0])[0]
            rh_roi_idx = np.where(rh_challenge_rois[r1] == r2[0])[0]
            lh_roi_correlation.append(lh_correlation[lh_roi_idx])
            rh_roi_correlation.append(rh_correlation[rh_roi_idx])
roi_names.append('All vertices')
lh_roi_correlation.append(lh_correlation)
rh_roi_correlation.append(rh_correlation)

# Create the plot
lh_median_roi_correlation = [np.median(lh_roi_correlation[r]) for r in range(len(lh_roi_correlation))]
rh_median_roi_correlation = [np.median(rh_roi_correlation[r]) for r in range(len(rh_roi_correlation))]
plt.figure(figsize=(18,6))
x = np.arange(len(roi_names))
width = 0.30
plt.bar(x - width/2, lh_median_roi_correlation, width, label='Left Hemisphere')
plt.bar(x + width/2, rh_median_roi_correlation, width,
    label='Right Hemishpere')
plt.xlim(left=min(x)-.5, right=max(x)+.5)
plt.ylim(bottom=0, top=1)
plt.xlabel('ROIs')
plt.xticks(ticks=x, labels=roi_names, rotation=60)
plt.ylabel('Median Pearson\'s $r$')
plt.legend(frameon=True, loc=1)
plt.title(f"Model: CLIP Visual {list(raw_features[0].keys())[1]}");

Encoding accuracy CLIP Visual transformer¶

Overall, the CLIP Visual Transformer Layer 7 model has a moderate to high encoding accuracy over most image processing ROIs.

The layer 7 of the visual transformer is the final layer in the visual processing pipeline and is responsible for encoding the visual information into a high-dimensional feature vector that can be used for downstream tasks. Therefore, a moderate to high correlation with the activity in higher visual areas is intuitive. The features extracted by layer 7 of the CLIP visual transformer are particularly useful in predicting this activation because they capture the most abstract and high-level visual information. These features may be more directly relevant to the neural activity patterns in these areas, which are known to be sensitive to complex visual features like object recognition and scene perception.

Although being the final layer in the visual pipileine, the CLIP model's attention mechanisms and hierarchical processing may have allowed this layer to capture some of the low-level visual features. These features are relevant for predicting brain activity in early visual areas (V1, V2, V3, and V4), thus explaining moderate correlation between the predicted and observed fMRI activity in these areas.

As expected, the correlation is slightly lower for word-processing ROIs (OWFA, VWFA-1, VWFA-2). This specific neural mechanisms underlying word processing may be complex and not fully captured by the visual features extracted by the CLIP model. For example, in addition to visual features, word processing also involves linguistic and semantic knowledge, which may not be fully represented in the visual features extracted by the model. However, while the CLIP Visual Transformer Layer 7 model may not be optimized for predicting neural activity patterns in word-selective regions, the fact that there is still moderate degree of correlation suggests that it is able to capture some of the underlying neural mechanisms involved in word processing.

k-Fold Cross-Validation¶

The above execution of code is just a visual display of what is going under the hood. In order to properly evaluate the model, we need to run a k-fold cross-validation procedure. To that extent we wrote custom classes to run this procedure. Depending on the selected k-fold procedure, different requirements like creating a subject or feature extractor are needed. We ran multiple 8-fold cross validation procedures on the vision only, text only and a combination of features from the Text and Vision models (all on data of subject 1).

All k-Folds were run with seed number 5.

CLIP Vision Model, PCA200, Linear Regression ¶

Our first k-fold cross validation focused on all layers of the CLIPVision model. We ran 8 folds on image data of subject 1. For each iteration the current fold was selected as the validation set and the remaining folds were used as the training set.

The validation procedure for each fold was as follows:

  • Supplying the training set to a CLIPVision model.
  • Extracting the raw features for all layers (Embedding, Transformer 1 - 12, Final Layer).
  • Looping over folds:
    • Assing training and validation sets for images and fMRI activity.
    • Looping over each layer:
      • Fitting a PCA with 200 components using the raw training features.
      • Transforming the raw training features using the fitted PCA.
      • Fitting a linear regression, predicting training fMRI activity from the PCA transformed training features (for left and right hemispheres).
      • Transforming the raw validation features using the fitted PCA.
      • Using the fitted linear models to predict validation fMRI activity from the PCA transformed validation features (for left and right hemispheres).
      • Calculating correlations between predicted and actual fMRI activity for the validation set (for left and right hemispheres).
      • Calculating median correlations for each of the 36 challenge ROI (for left and right hemispheres).
    • Storing the median correlations for each hemisphere and each layer for the current fold.
  • Calculating mean median correlations accross all folds for each layer (for left and right hemispheres).
  • Plotting the mean median correlations for each layer for each ROI and select the layer with the highest mean median correlation averaged over all ROIs.

Results

  • Transformer layer 7 achieved the highest overall performance (left hemisphere = 0.472, right hemisphere = 0.484) and was selected for submission (submission score = 49.3178).

  • Interestingly the plot below suggests, that earlier layers are better at predicting the activity of the earlier visual cortex (V1, V2, V3, V4) and later layers are better at predicting the activity of the other ROIs.

In [34]:
plot_kfold_result("001")
KFoldProcedure Class ¶
In [ ]:
class KFoldSingleCLIPSingleSubject(KFoldProcedure):
    """A procedure that runs k-fold on all layers of a single CLIP model on a single subject."""
    def __init__(self, 
                 feature_extractor: CLIPFeatureExtractor,
                 subject: Subject, 
                 pca: PCA,
                 model_name: str = None,
                 description: str = None) -> None:
        assert isinstance(feature_extractor, CLIPFeatureExtractor), "feature_extractor must be an instance of CLIPFeatureExtractor"
        assert isinstance(subject, Subject), "subject must be an instance of Subject"
        self.feature_extractor = feature_extractor
        self.subject = subject
        self.pca = pca
        self.model_name = model_name
        self.description = description
        self.correlations = {}
        super().__init__()

    def prepare(self):
        # Extract raw features
        self.feature_extractor.extract_raw_features_from_model()
        self.raw_features = self.feature_extractor.feature_dict
        self.feature_extractor = None # free memory

        # Load challenge rois
        self.subject.load_challenge_rois()
        self.lh_challenge_rois = self.subject.lh_challenge_rois
        self.rh_challenge_rois = self.subject.rh_challenge_rois
        self.roi_name_maps = self.subject.roi_name_maps

        # Load neural data
        self.subject.load_neural_data()
        self.lh_fmri = self.subject.lh_fmri
        self.rh_fmri = self.subject.rh_fmri
        self.subject = None # free memory

        # Prepare correlation dict
        self.fold_correlations = {}

    def run(self, train_idxs: np.ndarray, val_idxs: np.ndarray) -> Dict[str, Dict[str, np.ndarray]]:
        # Loop over all layers          
        correlations = {}
        for layer in self.raw_features.keys():
            print(f"> {layer}")
            # Assign train and val features
            train_features = self.raw_features[layer][train_idxs]
            val_features = self.raw_features[layer][val_idxs]

            # Assign train and val fmri
            train_lh_fmri = self.lh_fmri[train_idxs]
            train_rh_fmri = self.rh_fmri[train_idxs]
            val_lh_fmri = self.lh_fmri[val_idxs]
            val_rh_fmri = self.rh_fmri[val_idxs]

            # Fit PCA models
            print(f"Fitting PCA model for {layer}...")
            train_pca_features = self.pca.fit_transform(torch.tensor(train_features).flatten(1).numpy())
            del train_features # free memory

            # Fit linear regression
            print(f"Fitting linear regression models for {layer}...")
            lh_lin_reg = LinearRegression().fit(train_pca_features, train_lh_fmri)
            rh_lin_reg = LinearRegression().fit(train_pca_features, train_rh_fmri)
            del train_pca_features, train_lh_fmri, train_rh_fmri # free memory

            # Transform validation features
            print(f"Transforming validation features for {layer}...")
            val_txt_pca_features = self.pca.transform(torch.tensor(val_features).flatten(1).numpy())
            del val_features # free memory

            # Predict validation set
            print(f"Predicting validation set for {layer}...")
            lh_val_pred = lh_lin_reg.predict(val_txt_pca_features)
            rh_val_pred = rh_lin_reg.predict(val_txt_pca_features)
            del val_txt_pca_features, lh_lin_reg, rh_lin_reg # free memory
            
            # Calculate correlations
            print(f"Calculating correlations for {layer}...\n")
            lh_correlation, rh_correlation = self.calculate_correlations(lh_val_pred, rh_val_pred, val_lh_fmri, val_rh_fmri)
            lh_median_roi_correlation, rh_median_roi_correlation = self.calculate_median_correlations(lh_correlation, rh_correlation)
            
            # Store correlations
            correlations[layer] = {"lh": lh_median_roi_correlation, "rh": rh_median_roi_correlation} 
        return correlations

    def return_idxs(self) -> np.ndarray:
        return np.arange(len(self.raw_features[list(self.raw_features.keys())[0]])) 
    
Executing the k-Fold Procedure¶
In [ ]:
# Raise exception to prevent accidental execution
raise Exception("This code is not meant to be run. It is only meant to be used as a reference for the code used to generate the submission.")

# Initializing all required objects
subject = Subject("subj01")
subject.create_dataloaders(processor=processor, batch_size=300)
feature_extractor = CLIPFeatureExtractor(
    idxs=[0,1,2,3,4,5,6,7,8,9,10,11,12], 
    last_hidden_layer=True, 
    model=vis_model, # here we use the vis_model
    dataloader=subject.train_img_dataloader) # and the img_dataloader

# Initialize the kfold procedure
kfold_procedure = KFoldSingleCLIPSingleSubject(
    feature_extractor=feature_extractor,
    subject=subject,
    pca=PCA(n_components=200),
    model_name="CLIP Vision",
    description="CLIP Vision Model, all layers, PCA 200, Single layer linear regression, 8-fold cross validation"
)

# Run KFold
KFold(folds=8, seed=5, procedure=kfold_procedure).run()
Additional Data Analysis¶

Because the cross validation plot indicated that earlier layers are better at predicting earlier ROIs, we checked which layer would have the highest mean median correlation averaged over early ROI (V1, V2, V3, V4) and which layer has the highest mean median correlation averaged over all other ROI.

The results indicate that Transformer Layer 4 is best at predicting activity in the early visual cortex, while layer 8 is the best layer of the rest of the cortex. Because of these findings we decided to cross validate another model against our strongest model so far.

In [ ]:
validation_folder = "validations"
validation = "001"

df = pd.read_csv(f"{validation_folder}/validation{validation}/results.csv")
df = df.drop(columns="model")

early_roi = ["V1v", "V1d", "V2v", "V2d", "V3v", "V3d"]
best_early_lh = df[(df.roi.isin(early_roi)) & (df.hemisphere == "lh")].groupby(["layer"]).mean(["correlations"]).sort_values(by='correlation', ascending=False).round(3).head(1).index.values[0]
best_early_lh_cor = df[(df.roi.isin(early_roi)) & (df.hemisphere == "lh")].groupby(["layer"]).mean(["correlations"]).sort_values(by='correlation', ascending=False).round(3).head(1).values[0][0]
best_early_rh = df[(df.roi.isin(early_roi)) & (df.hemisphere == "rh")].groupby(["layer"]).mean(["correlations"]).sort_values(by='correlation', ascending=False).round(3).head(1).index.values[0]
best_early_rh_cor = df[(df.roi.isin(early_roi)) & (df.hemisphere == "rh")].groupby(["layer"]).mean(["correlations"]).sort_values(by='correlation', ascending=False).round(3).head(1).values[0][0]
best_late_lh = df[~(df.roi.isin(early_roi)) & (df.hemisphere == "lh")].groupby(["layer"]).mean(["correlations"]).sort_values(by='correlation', ascending=False).round(3).head(1).index.values[0]
best_late_lh_cor = df[~(df.roi.isin(early_roi)) & (df.hemisphere == "lh")].groupby(["layer"]).mean(["correlations"]).sort_values(by='correlation', ascending=False).round(3).head(1).values[0][0]
best_late_rh = df[~(df.roi.isin(early_roi)) & (df.hemisphere == "rh")].groupby(["layer"]).mean(["correlations"]).sort_values(by='correlation', ascending=False).round(3).head(1).index.values[0]
best_late_rh_cor = df[~(df.roi.isin(early_roi)) & (df.hemisphere == "rh")].groupby(["layer"]).mean(["correlations"]).sort_values(by='correlation', ascending=False).round(3).head(1).values[0][0]
print(f"The best layer for early ROI on the left hemisphere is\t\t==>\t{best_early_lh} ({best_early_lh_cor})")
print(f"The best layer for early ROI on the right hemisphere is\t\t==>\t{best_early_rh} ({best_early_rh_cor})")
print(f"The best layer for later ROI on the left hemisphere is\t\t==>\t{best_late_lh} ({best_late_lh_cor})")
print(f"The best layer for later ROI on the right hemisphere is\t\t==>\t{best_late_rh} ({best_late_rh_cor})")

CLIP Text Model, PCA200, Linear Regression ¶

Our first k-fold cross validation focused on all layers of the CLIPText model. We ran 8 folds on caption data of subject 1. For each iteration the current fold was selected as the validation set and the remaining folds were used as the training set.

The validation procedure for each fold is the same as for the Vision Model. Therefore, all the steps remain the same with the exemption of the first step, where, naturally, the training set was supplied to a CLIPText model instead of a CLIPVision model.

Results

  • Transformer layer 11 achieved the highest overall performance (left hemisphere = 0.293, right hemisphere = 0.306) but was much worse than transformer layer 7 of the vision only model (left hemisphere = 0.472, right hemisphere = 0.484), therefore we did not submit this.

  • As opposed to the vision only model, the plot below indicates that later levels of the text only model are better to predict activity in all ROIs. Additionally, the text only model performed much worse on the early visual cortex (V1, V2, V3, V4) than the vision only model but performed comparably for the later layers.

Note: The KFoldProcedure Class is the same as for the Vision Model!

In [35]:
plot_kfold_result("002")
Executing the k-Fold Procedure¶
In [ ]:
# Raise exception to prevent accidental execution
raise Exception("This code is not meant to be run. It is only meant to be used as a reference for the code used to generate the submission.")

# Initializing all required objects
subject = Subject("subj01")
subject.create_dataloaders(processor=processor, batch_size=300)
feature_extractor = CLIPFeatureExtractor(
    idxs=[0,1,2,3,4,5,6,7,8,9,10,11,12], 
    last_hidden_layer=True, 
    model=txt_model, # here we use the txt_model 
    dataloader=subject.train_txt_dataloader) # and the txt_dataloader

# Initialize the kfold procedure
kfold_procedure = KFoldSingleCLIPSingleSubject(
    feature_extractor=feature_extractor,
    subject=subject,
    pca=PCA(n_components=200),
    model_name="CLIP Text",
    description="CLIP Text Model, all layers, PCA 200, Single layer linear regression, 8-fold cross validation"
)

# Run KFold
KFold(folds=8, seed=5, procedure=kfold_procedure).run()

CLIP Text + Vision Model, PCA200, Linear Regression ¶

Our third k-fold cross validation focused on combining all layers of the CLIPVision model and CLIPText model, in a way that we first fitted separate PCAs for CLIPVision and CLIPtest and then combined the features.

We ran 8 folds on caption and image data of subject 1. For each iteration the current fold was selected as the validation set and the remaining folds were used as the training set. Because of RAM limitations we ran the cross validation for the combined model in three batches (Embedding Layer to Transformer Layer 4, Transformer Layers 5 to 9, Transformer Layer 10 to Final Layer). Afterwards we manually combined the .csv files that contained the results to provide one plot including all layers.

The validation procedure for each fold was the same as in the previous two cases, with the adjustment for combining features from two models.

In summary, the caption training set was supplied to a CLIPText model and the image training set to a CLIPVision model in Step 1. When looping over folds, training and validation sets were asigned for images, COCO captions and fMRI activity. When looping over each layer, the steps were:

  • Fitting a PCA with 200 components using the raw training features (for text and images).
  • Transforming the raw training features using the fitted PCA (for text and images).
  • Fitting a linear regression, predicting training fMRI activity from the combination of PCA transformed image and caption training features (for left and right hemisphere).
  • Transforming the raw validation features using the fitted PCA (for text and images).
  • Using the fitted linear models to predict validation fMRI activity from the combination of pca transformed image and caption validation features (for left and right hemisphere).
  • Calculating correlations between predicted and actual fmri activity for the validation set (for left and right hemisphere).
  • Calculating median correlations for each of the 36 challenge ROI (for left and right hemisphere).
  • Storing the median correlations for each hemisphere and each layer for the current fold.

Finally, as for the Vision Model mean median correlations were calculated across all folds for each layer, and the layer with the highest mean median correlation averaged over all ROIs was selected by plotting.

Results

  • Similar to the vision only model, transformer layer 7 achieved the highest overall performance (left hemisphere = 0.468, right hemisphere = 0.48). But because it was slightly worse than the CLIPVision only model (left hemisphere = 0.472, right hemisphere = 0.484) we did not submit this.
In [36]:
plot_kfold_result("006")
KFoldProcedure Class¶
In [ ]:
class KFoldCombinedCLIPSingleSubject(KFoldProcedure):
    """A procedure that runs k-fold on all layers of a combination of the text and vision CLIP model on a single subject."""
    def __init__(self, 
                 subject: Subject, 
                 txt_pca: PCA,
                 img_pca: PCA,
                 model_name: str = None,
                 description: str = None,
                 layers: list = [],
                 last_hidden_state = False) -> None:
        self.subject = subject
        self.txt_pca = txt_pca
        self.img_pca = img_pca
        self.model_name = model_name
        self.description = description
        self.layers = layers.copy()
        self.layers2 = layers.copy()
        self.last_hidden_state = last_hidden_state
        self.correlations = {}
        super().__init__()

    def prepare(self):
        # Prepare data
        self.subject.create_dataloaders(processor, batch_size=400)
        train_img_dataloader = self.subject.train_img_dataloader
        train_txt_dataloader = self.subject.train_txt_dataloader

        # Extract raw features from text model
        feature_extractor = CLIPFeatureExtractor(idxs=self.layers, last_hidden_layer=self.last_hidden_state, model=txt_model, dataloader=train_txt_dataloader)
        feature_extractor.extract_raw_features_from_model()
        self.raw_txt_features = feature_extractor.feature_dict
        del feature_extractor # free memory

        # Extract raw features from vision model
        feature_extractor = CLIPFeatureExtractor(idxs=self.layers2, last_hidden_layer=self.last_hidden_state, model=vis_model, dataloader=train_img_dataloader)
        feature_extractor.extract_raw_features_from_model()
        self.raw_img_features = feature_extractor.feature_dict
        del feature_extractor # free memory

        # Load challenge rois
        self.subject.load_challenge_rois()
        self.lh_challenge_rois = self.subject.lh_challenge_rois
        self.rh_challenge_rois = self.subject.rh_challenge_rois
        self.roi_name_maps = self.subject.roi_name_maps

        # Load neural data
        self.subject.load_neural_data()
        self.lh_fmri = self.subject.lh_fmri
        self.rh_fmri = self.subject.rh_fmri
        self.subject = None # free memory

        # Prepare correlation dict
        self.fold_correlations = {}

    def run(self, train_idxs: np.ndarray, val_idxs: np.ndarray) -> Dict[str, Dict[str, np.ndarray]]:
        # Loop over all layers          
        correlations = {}
        for layer in self.raw_txt_features.keys():
            print(f"> {layer}")
            # Assign train and val text features
            train_txt_features = self.raw_txt_features[layer][train_idxs]
            val_txt_features = self.raw_txt_features[layer][val_idxs]
            # Assign train and val vision features
            train_img_features = self.raw_img_features[layer][train_idxs]
            val_img_features = self.raw_img_features[layer][val_idxs]

            # Assign train and val fmri
            train_lh_fmri = self.lh_fmri[train_idxs]
            train_rh_fmri = self.rh_fmri[train_idxs]
            val_lh_fmri = self.lh_fmri[val_idxs]
            val_rh_fmri = self.rh_fmri[val_idxs]

            # Fit PCA models
            print(f"Fitting PCA model for {layer}...")
            train_txt_pca_features = self.txt_pca.fit_transform(torch.tensor(train_txt_features).flatten(1).numpy())
            del train_txt_features # free memory
            train_img_pca_features = self.img_pca.fit_transform(torch.tensor(train_img_features).flatten(1).numpy())
            del train_img_features # free memory

            # Fit linear regression
            print(f"Fitting linear regression models for {layer}...")
            lh_lin_reg = LinearRegression().fit(np.hstack([train_txt_pca_features, train_img_pca_features]), train_lh_fmri)
            rh_lin_reg = LinearRegression().fit(np.hstack([train_txt_pca_features, train_img_pca_features]), train_rh_fmri)
            del train_txt_pca_features, train_img_pca_features, train_lh_fmri, train_rh_fmri # free memory

            # Transform validation features
            print(f"Transforming validation features for {layer}...")
            val_txt_pca_features = self.txt_pca.transform(torch.tensor(val_txt_features).flatten(1).numpy())
            del val_txt_features # free memory
            val_img_pca_features = self.img_pca.transform(torch.tensor(val_img_features).flatten(1).numpy())
            del val_img_features # free memory

            # Predict validation set
            print(f"Predicting validation set for {layer}...")
            lh_val_pred = lh_lin_reg.predict(np.hstack([val_txt_pca_features, val_img_pca_features]))
            rh_val_pred = rh_lin_reg.predict(np.hstack([val_txt_pca_features, val_img_pca_features]))
            del val_txt_pca_features, val_img_pca_features, lh_lin_reg, rh_lin_reg # free memory
            
            # Calculate correlations
            print(f"Calculating correlations for {layer}...\n")
            lh_correlation, rh_correlation = self.calculate_correlations(lh_val_pred, rh_val_pred, val_lh_fmri, val_rh_fmri)
            lh_median_roi_correlation, rh_median_roi_correlation = self.calculate_median_correlations(lh_correlation, rh_correlation)
            
            # Store correlations
            correlations[layer] = {"lh": lh_median_roi_correlation, "rh": rh_median_roi_correlation} 
        return correlations

    def return_idxs(self) -> np.ndarray:
        return np.arange(len(self.raw_txt_features[list(self.raw_txt_features.keys())[0]])) 
Executing the k-Fold Procedure¶
In [ ]:
# Raise exception to prevent accidental execution
raise Exception("This code is not meant to be run. It is only meant to be used as a reference for the code used to generate the submission.")

# Initialize required objects
subject = Subject("subj01")

# Initialize kfold procedure
kfold_procedure = KFoldCombinedCLIPSingleSubject(
    subject=subject, 
    txt_pca=PCA(n_components=200), 
    img_pca=PCA(n_components=200), 
    model_name="CLIP Vision + Text", 
    description="CLIP Vision + Text Model, Embedding layer and first 4 Transformer Layers, PCA 200, Single layer linear regression, 8-fold cross validation", 
    layers=[0,1,2,3,4], # only first 5 layers
    last_hidden_state=False
    )

# Run first kfold
KFold(folds=8, procedure=kfold_procedure).run()

# Initalize kfold procedure for second batch
kfold_procedure = KFoldCombinedCLIPSingleSubject(
    subject=subject, 
    txt_pca=PCA(n_components=200), 
    img_pca=PCA(n_components=200), 
    model_name="CLIP Vision + Text", 
    description="CLIP Vision + Text Model, Transformer layers 5 to 9, PCA 200, Single layer linear regression, 8-fold cross validation", 
    layers=[5,6,7,8,9], # the next 5 layers
    last_hidden_state=False
    )
# Run second kfold
KFold(folds=8, procedure=kfold_procedure).run()

# Initalize kfold procedure for third batch
kfold_procedure = KFoldCombinedCLIPSingleSubject(
    subject=subject, 
    txt_pca=PCA(n_components=200), 
    img_pca=PCA(n_components=200), 
    model_name="CLIP Vision + Text", 
    description="CLIP Vision + Text Model, Transformer layers 10 to 12 and Final Layer, PCA 200, Single layer linear regression, 8-fold cross validation", 
    layers=[10,11,12], # the next 4 layers
    last_hidden_state=True # and final layer
    )
# Run third kfold
KFold(folds=8, procedure=kfold_procedure).run()

CLIP Vision Model, Layer 4 + 8, Normal & Combined PCA200, Linear Regression

Given the results of our previous cross-validations, we decided to focus on the CLIPVision model to try to further improve its performance. Precisely, we decided to test how will different way of combining features from CLIPVision layers change the predictive performance. Thus, we compared combining the raw features of layers 4 and 8 before fitting a singular PCA and the approach of fitting two separate PCAs and then combining the features.

We ran 8 folds on image data of subject 1. For each iteration the current fold was selected as the validation set and the remaining folds were used as the training set.

We tested two ways of combining features from layers 4 and 8. First, we extracted the raw features of both layers, flattened, and concatenated them to fit a singular PCA with 200 components on the combined featureset. Second, we extracted the raw features of both layers, flattened them, fit two separate PCAs with 200 components each and then combined the two PCAs to supply 400 features to the linear regression. This way we could compare the performance of two different methods of reducing dimensionality.

The validation procedure for each fold was as follows:

Preparation

  • Supplying the training set to a CLIPVision model.
  • Extracting the raw features for Transformer Layers 4 and 8.

Looping over all folds

  • For each fold:

    • Assigning training and validation sets for images and fMRI activity.

      PCA(4+8)

    • Combine raw training features of layer 4 and 8.
    • Fit PCA with 200 components using the combined raw training features of layers 4 and 8.
    • Transforming the raw training features using the fitted PCA.
    • Fitting a linear regression, predicting training fmri activity from the pca transformed training features. (for left and right hemisphere)
    • Transforming the raw validation features using the fitted PCA.
    • Using the fitted linear models to predict validation fmri activity from the pca transformed validation features. (for left and right hemisphere)
    • Calculating correlations between predicted and actual fmri activity for the validation set. (for left and right hemisphere)
    • Calculating median correlations for each of the 36 challenge ROI. (for left and right hemisphere)
    • Storing the median correlations for each hemisphere for the combination of PCA(4+8) for the current fold.

      PCA(4) + PCA(8)

    • Fit PCA with 200 components using the raw training features of layer 4.
    • Fit PCA with 200 components using the raw training features of layer 8.
    • Transforming the raw training features using the fitted PCAs.
    • Combining the transformed training features of layer 4 and 8.
    • Fitting a linear regression, predicting training fmri activity from the PCA transformed training features. (for left and right hemisphere)
    • Transforming the raw validation features using the fitted PCAs.
    • Combining the transformed validation features of layer 4 and 8.
    • Using the fitted linear models to predict validation fmri activity from the PCA transformed validation features. (for left and right hemisphere)
    • Calculating correlations between predicted and actual fmri activity for the validation set. (for left and right hemisphere)
    • Calculating median correlations for each of the 36 challenge ROI. (for left and right hemisphere)
    • Storing the median correlations for each hemisphere for the combination of PCA(4) + PCA(8) for the current fold.

After the validation

  • Calculating mean median correlations accross all folds for each layer. (for left and right hemisphere)
  • Plotting the mean median correlations for each layer for each ROI and select the layer with the highest mean median correlation averaged over all ROIs.

Results

  • As can be seen in the plot below, combining the raw features of layers 4 and 8 before fitting a singular PCA (left hemisphere = 0.482, right hemisphere = 0.494) greatly outperformed the approach of fitting two separate PCAs and then combining them (left hemisphere = 0.229, right hemisphere = 0.223). Furthermore, this combination of layers 4 and 8 outperformed our previous validation attempt with layer 7 (left hemisphere = 0.472, right hemisphere = 0.484). Therefore, we decided to submit this model and received our highest submission score (50.541).
In [37]:
plot_kfold_result("007")
KFoldProcedure Class¶
In [ ]:
class KFoldVisPCA200L4L8(KFoldProcedure):
    """A procedure that runs k-fold on all layers of a single CLIP model on a single subject."""
    def __init__(self, 
                 subject: Subject, 
                 model_name: str = None,
                 description: str = None) -> None:
        assert isinstance(subject, Subject), "subject must be an instance of Subject"
        self.subject = subject
        self.model_name = model_name
        self.description = description
        self.correlations = {}
        super().__init__()

    def prepare(self):
        # Prepare data
        self.subject.create_dataloaders(processor, batch_size=400)
        train_dataloader = self.subject.train_img_dataloader
        
        # Extract raw features
        feature_extractor = CLIPFeatureExtractor(idxs=[4,8], last_hidden_layer=False, model=vis_model, dataloader=train_dataloader)
        feature_extractor.extract_raw_features_from_model()
        self.raw_features = feature_extractor.feature_dict
        del feature_extractor # free memory

        # Load challenge rois
        self.subject.load_challenge_rois()
        self.lh_challenge_rois = self.subject.lh_challenge_rois
        self.rh_challenge_rois = self.subject.rh_challenge_rois
        self.roi_name_maps = self.subject.roi_name_maps

        # Load neural data
        self.subject.load_neural_data()
        self.lh_fmri = self.subject.lh_fmri
        self.rh_fmri = self.subject.rh_fmri
        self.subject = None # free memory

        # Prepare correlation dict
        self.fold_correlations = {}

    def run(self, train_idxs: np.ndarray, val_idxs: np.ndarray) -> Dict[str, Dict[str, np.ndarray]]:        
        correlations = {}

        # Assign train and val fmri
        train_lh_fmri = self.lh_fmri[train_idxs]
        train_rh_fmri = self.rh_fmri[train_idxs]
        val_lh_fmri = self.lh_fmri[val_idxs]
        val_rh_fmri = self.rh_fmri[val_idxs]

        ####### Layers PCA (4 + 8) #######
        # Assign train and val features
        train_features = self.raw_features["Transformer Layer 4"][train_idxs]
        train_features = np.hstack([train_features, self.raw_features["Transformer Layer 8"][train_idxs]])
        val_features = self.raw_features["Transformer Layer 4"][val_idxs]
        val_features = np.hstack([val_features, self.raw_features["Transformer Layer 8"][val_idxs]])

        # Fit PCA model
        pca = PCA(n_components=200)
        train_pca_features = pca.fit_transform(torch.tensor(train_features).flatten(1).numpy())
        del train_features # free memory

        # Fit linear regressions
        lh_lin_reg = LinearRegression().fit(train_pca_features, train_lh_fmri)
        rh_lin_reg = LinearRegression().fit(train_pca_features, train_rh_fmri)
        del train_pca_features # free memory

        # Transform validation features
        val_txt_pca_features = pca.transform(torch.tensor(val_features).flatten(1).numpy())
        del val_features, pca # free memory

        # Predict validation dict
        lh_val_pred = lh_lin_reg.predict(val_txt_pca_features)
        rh_val_pred = rh_lin_reg.predict(val_txt_pca_features)
        del val_txt_pca_features, lh_lin_reg, rh_lin_reg # free memory

        # Calculate correlations
        lh_correlation, rh_correlation = self.calculate_correlations(lh_val_pred, rh_val_pred, val_lh_fmri, val_rh_fmri)
        lh_median_roi_correlation, rh_median_roi_correlation = self.calculate_median_correlations(lh_correlation, rh_correlation)

        # Store correlations
        correlations["Transformer Layer PCA(4 + 8)"] = {"lh": lh_median_roi_correlation, "rh": rh_median_roi_correlation} 

        ####### Layers PCA(4) + PCA(8) #######
        # Assign train and val features
        train_features_4 = self.raw_features["Transformer Layer 4"][train_idxs]
        train_features_8 = self.raw_features["Transformer Layer 8"][train_idxs]
        val_features_4 = self.raw_features["Transformer Layer 4"][val_idxs]
        val_features_8 = self.raw_features["Transformer Layer 8"][val_idxs]

        # Fit PCA models
        pca_4 = PCA(n_components=200)
        train_pca_features_4 = pca_4.fit_transform(torch.tensor(train_features_4).flatten(1).numpy())
        del train_features_4 # free memory
        pca_8 = PCA(n_components=200)
        train_pca_features_8 = pca_8.fit_transform(torch.tensor(train_features_8).flatten(1).numpy())
        del train_features_8 # free memory
        train_pca_features = np.hstack([train_pca_features_4, train_pca_features_8])
        del train_pca_features_4, train_pca_features_8 # free memory

        # Fit linear regressions
        lh_lin_reg = LinearRegression().fit(train_pca_features, train_lh_fmri)
        rh_lin_reg = LinearRegression().fit(train_pca_features, train_rh_fmri)
        del train_pca_features # free memory

        # Transform validation features
        val_txt_pca_features_4 = pca_4.transform(torch.tensor(val_features_4).flatten(1).numpy())
        del val_features_4 # free memory
        val_txt_pca_features_8 = pca_4.transform(torch.tensor(val_features_8).flatten(1).numpy())
        del val_features_8 # free memory
        val_txt_pca_features = np.hstack([val_txt_pca_features_4, val_txt_pca_features_8])
        del val_txt_pca_features_4, val_txt_pca_features_8, pca_4, pca_8 # free memory

        # Predict validation dict
        lh_val_pred = lh_lin_reg.predict(val_txt_pca_features)
        rh_val_pred = rh_lin_reg.predict(val_txt_pca_features)
        del val_txt_pca_features, lh_lin_reg, rh_lin_reg # free memory

        # Calculate correlations
        lh_correlation, rh_correlation = self.calculate_correlations(lh_val_pred, rh_val_pred, val_lh_fmri, val_rh_fmri)
        lh_median_roi_correlation, rh_median_roi_correlation = self.calculate_median_correlations(lh_correlation, rh_correlation)

        # Store correlations
        correlations["Transformer Layer PCA(4) + PCA(8)"] = {"lh": lh_median_roi_correlation, "rh": rh_median_roi_correlation} 

        return correlations

    def return_idxs(self) -> np.ndarray:
        return np.arange(len(self.raw_features[list(self.raw_features.keys())[0]])) 
    
Executing the k-Fold Procedure¶
In [ ]:
# Raise exception to prevent accidental execution
raise Exception("This code is not meant to be run. It is only meant to be used as a reference for the code used to generate the submission.")

# Initialize required objects
subject = Subject("subj01")

# Initialize kfold procedure
kfold_procedure = KFoldVisPCA200L4L8(
    subject=subject, 
    model_name="CLIP Vision", 
    description="CLIP Vision, Layer 4+8, PCA 200, Linear regression, 8-fold cross validation", 
    )

# Run kfold
KFold(folds=8, procedure=kfold_procedure).run()

CLIP Text + Vision Model, Vision Layer 4 + 8 & Text Layer 11, Combined PCA200, Linear Regression¶

Because the results of our second (CLIP Text Model, PCA200) cross-validation showed that Transformer layer 11 achieved the highest performance (left hemisphere = 0.293, right hemisphere = 0.306), we decided to combine the CLIPVision model (PCA200, Layer 4+8) and the CLIPText model (PCA200, Layer11).

We ran 8 folds on image data of subject 1. For each iteration the current fold was selected as the validation set and the remaining folds were used as the training set.

The validation procedure for each fold was similar as in CLIPVision Model, Layers 4 + 8, Normal & Combined PCA200, Linear Regression.

Preparation

  • Supplying the training set to a CLIPVision model and CLIPText model.
  • Extracting the raw features for Transformer Layers 4, 8 from the CLIPVision model and extracting the raw features for Transformer Layer 11 from the CLIPText model.

Looping over all folds

  • Combine raw training features of CLIPVision Layer 4 and 8, and CLIPText model Layer 11.
  • Fit PCA with 200 components using the combined raw training features of CLIPVision model Layers 4 and 8, and CLIPText model Layer 11.
  • Transforming the raw training features using the fitted PCA.
  • Fitting a linear regression, predicting training fMRI activity from the PCA transformed training features (for left and right hemispheres).
  • Transforming the raw validation features using the fitted PCA.
  • Using the fitted linear models to predict validation fMRI activity from the PCA transformed validation features (for left and right hemispheres).
  • Calculating correlations between predicted and actual fMRI activity for the validation set (for left and right hemispheres).
  • Calculating median correlations for each of the 36 challenge ROI (for left and right hemispheres).
  • Storing the median correlations for each hemisphere for the combination of layers 4 and 8 for the current fold.

After the validation

  • Calculating mean median correlations accross all folds for each layer (for left and right hemispheres).
  • Plotting the mean median correlations for each layer for each ROI and select the layer with the highest mean median correlation averaged over all ROIs.

Results

  • The combined model achieved a lower performance (left hemisphere = 0.444, right hemisphere = 0.459), then combining only layers of CLIPVision model. Therefore, we did not submit this solution.
In [38]:
plot_kfold_result("008")
KFoldProcedure Class¶
In [ ]:
class KFoldVisL4L8TextL11PCA200(KFoldProcedure):
    """Combining layers 4 and 8 of CLIP Vision and layer 11 of CLIP Text, PCA 200, Linear regression, 8-fold cross validation"""
    def __init__(self, 
                 subject: Subject, 
                 model_name: str = None,
                 description: str = None) -> None:
        assert isinstance(subject, Subject), "subject must be an instance of Subject"
        self.subject = subject
        self.model_name = model_name
        self.description = description
        self.correlations = {}
        super().__init__()

    def prepare(self):
        # Prepare data
        subject.create_dataloaders(processor, batch_size = 300)
        train_img_dataloader = subject.train_img_dataloader
        train_txt_dataloader = subject.train_txt_dataloader
        
        # Extract raw img features
        feature_extractor = CLIPFeatureExtractor(idxs=[4,8], last_hidden_layer=False, model=vis_model, dataloader=train_img_dataloader)
        feature_extractor.extract_raw_features_from_model()
        self.raw_img_features = feature_extractor.feature_dict
        del feature_extractor # free memory

        # Extract raw txt features
        feature_extractor = CLIPFeatureExtractor(idxs=[11], last_hidden_layer=False, model=txt_model, dataloader=train_txt_dataloader)
        feature_extractor.extract_raw_features_from_model()
        self.raw_txt_features = feature_extractor.feature_dict
        del feature_extractor # free memory

        # Load challenge rois
        self.subject.load_challenge_rois()
        self.lh_challenge_rois = self.subject.lh_challenge_rois
        self.rh_challenge_rois = self.subject.rh_challenge_rois
        self.roi_name_maps = self.subject.roi_name_maps

        # Load neural data
        self.subject.load_neural_data()
        self.lh_fmri = self.subject.lh_fmri
        self.rh_fmri = self.subject.rh_fmri
        self.subject = None # free memory

        # Prepare correlation dict
        self.fold_correlations = {}

    def run(self, train_idxs: np.ndarray, val_idxs: np.ndarray) -> Dict[str, Dict[str, np.ndarray]]:        
        correlations = {}

        # Assign train and val fmri
        train_lh_fmri = self.lh_fmri[train_idxs]
        train_rh_fmri = self.rh_fmri[train_idxs]
        val_lh_fmri = self.lh_fmri[val_idxs]
        val_rh_fmri = self.rh_fmri[val_idxs]

        ####### Layers PCA (4 + 8) #######
        # Assign train and val features
        train_features = np.concatenate([torch.tensor(self.raw_img_features["Transformer Layer 4"][train_idxs]).flatten(1).numpy(), 
                                         torch.tensor(self.raw_img_features["Transformer Layer 8"][train_idxs]).flatten(1).numpy(),
                                         torch.tensor(self.raw_txt_features["Transformer Layer 11"][train_idxs]).flatten(1).numpy()], axis=1)
        val_features = np.concatenate([torch.tensor(self.raw_img_features["Transformer Layer 4"][val_idxs]).flatten(1).numpy(), 
                                       torch.tensor(self.raw_img_features["Transformer Layer 8"][val_idxs]).flatten(1).numpy(),
                                       torch.tensor(self.raw_txt_features["Transformer Layer 11"][val_idxs]).flatten(1).numpy()], axis=1)

        # Fit PCA model
        pca = PCA(n_components=200)
        train_pca_features = pca.fit_transform(torch.tensor(train_features).flatten(1).numpy())
        del train_features # free memory

        # Fit linear regressions
        lh_lin_reg = LinearRegression().fit(train_pca_features, train_lh_fmri)
        rh_lin_reg = LinearRegression().fit(train_pca_features, train_rh_fmri)
        del train_pca_features # free memory

        # Transform validation features
        val_pca_features = pca.transform(torch.tensor(val_features).flatten(1).numpy())
        del val_features, pca # free memory

        # Predict validation dict
        lh_val_pred = lh_lin_reg.predict(val_pca_features)
        rh_val_pred = rh_lin_reg.predict(val_pca_features)
        del val_pca_features, lh_lin_reg, rh_lin_reg # free memory

        # Calculate correlations
        lh_correlation, rh_correlation = self.calculate_correlations(lh_val_pred, rh_val_pred, val_lh_fmri, val_rh_fmri)
        lh_median_roi_correlation, rh_median_roi_correlation = self.calculate_median_correlations(lh_correlation, rh_correlation)

        # Store correlations
        correlations["PCA(Vis4 + Vis8 + Txt11)"] = {"lh": lh_median_roi_correlation, "rh": rh_median_roi_correlation} 

        return correlations

    def return_idxs(self) -> np.ndarray:
        return np.arange(len(self.raw_img_features[list(self.raw_img_features.keys())[0]])) 
    
Executing the k-Fold Procedure¶
In [ ]:
# Raise exception to prevent accidental execution
raise Exception("This code is not meant to be run. It is only meant to be used as a reference for the code used to generate the submission.")

# Initialize required objects
subject = Subject("subj01")

# Initialize kfold procedure
kfold_procedure = KFoldVisL4L8TextL11PCA200(
    subject=subject, 
    model_name="CLIP Vision + Text Model (V4, V8, T11)", 
    description="CLIP Vision + Text Model, Vision Layer 4+8 & Text Layer 11, Combiend PCA 200, Linear regression, 8-fold cross validation", 
    )

# Run kfold
KFold(folds=8, procedure=kfold_procedure).run()

CLIP Text + Vision Model, Combined PCA200, Linear Regression¶

Finally, we decided to test how will different way of combining features from CLIPVision and CLIPText models change the performance when including all layers. The validation procedure for each fold was similar as in CLIP Text + Vision Model, PCA200, Linear Regression, with the exemption of running one PCA on combined text and vision layer features, instead of running a separate PCA on each layer features. Therefore, we did the same comparison as in CLIP Vision Model, Layer 4 + 8, Normal & Combined PCA200, Linear Regression, but now including all layers.

We ran 8 folds on image data of subject 1. For each iteration the current fold was selected as the validation set and the remaining folds were used as the training set.

Preparation

  • Supplying the caption training set to a CLIPText model and the image training set to a CLIPVision model
  • Extracting the raw features for all layers (Embedding, Transformer 1 - 12, Final Layer) (for text and images).

Looping over all folds

  • Combine raw training features of CLIPVision model and CLIPText model.
  • Fit PCA with 200 components using the combined raw training features of CLIPVision model and CLIPText model.
  • Transforming the raw training features using the fitted PCA.
  • Fitting a linear regression, predicting training fMRI activity from the PCA transformed training features (for left and right hemispheres).
  • Transforming the raw validation features using the fitted PCA.
  • Using the fitted linear models to predict validation fMRI activity from the PCA transformed validation features (for left and right hemispheres).
  • Calculating correlations between predicted and actual fMRI activity for the validation set (for left and right hemispheres).
  • Calculating median correlations for each of the 36 challenge ROI (for left and right hemispheres).
  • Storing the median correlations for each hemisphere and each layer for the current fold.

After the validation

  • Calculating mean median correlations accross all folds for each layer (for left and right hemispheres).
  • Plotting the mean median correlations for each layer for each ROI and select the layer with the highest mean median correlation averaged over all ROIs.

Results

  • Running the PCA on combined text and vision features outperformed the solution with running a separate PCA on text and vision features separately (left hemisphere = 0.469, right hemisphere = 0.482). However, as the result was still worse than our best model (CLIPVision model, Combined Layer 4 + 8), we did not submit this model.
In [39]:
plot_kfold_result("012")
KFoldProcedure Class¶
In [ ]:
class KFoldCombinedCLIPSingleSubjectComb(KFoldProcedure):
    """A procedure that runs k-fold on all layers of a combination of the text and vision CLIP model on a single subject."""
    def __init__(self, 
                 subject: Subject, 
                 model_name: str = None,
                 description: str = None,
                 layers: list = [],
                 last_hidden_state = False) -> None:
        self.subject = subject
        self.model_name = model_name
        self.description = description
        self.layers = layers.copy()
        self.layers2 = layers.copy()
        self.last_hidden_state = last_hidden_state
        self.correlations = {}
        super().__init__()

    def prepare(self):
        # Prepare data
        self.subject.create_dataloaders(processor, batch_size=400)
        train_img_dataloader = self.subject.train_img_dataloader
        train_txt_dataloader = self.subject.train_txt_dataloader

        # Extract raw features from text model
        feature_extractor = CLIPFeatureExtractor(idxs=self.layers, last_hidden_layer=self.last_hidden_state, model=txt_model, dataloader=train_txt_dataloader)
        feature_extractor.extract_raw_features_from_model()
        self.raw_txt_features = feature_extractor.feature_dict
        del feature_extractor # free memory

        # Extract raw features from vision model
        feature_extractor = CLIPFeatureExtractor(idxs=self.layers2, last_hidden_layer=self.last_hidden_state, model=vis_model, dataloader=train_img_dataloader)
        feature_extractor.extract_raw_features_from_model()
        self.raw_img_features = feature_extractor.feature_dict
        del feature_extractor # free memory

        # Load challenge rois
        self.subject.load_challenge_rois()
        self.lh_challenge_rois = self.subject.lh_challenge_rois
        self.rh_challenge_rois = self.subject.rh_challenge_rois
        self.roi_name_maps = self.subject.roi_name_maps

        # Load neural data
        self.subject.load_neural_data()
        self.lh_fmri = self.subject.lh_fmri
        self.rh_fmri = self.subject.rh_fmri
        self.subject = None # free memory

        # Prepare correlation dict
        self.fold_correlations = {}

    def run(self, train_idxs: np.ndarray, val_idxs: np.ndarray) -> Dict[str, Dict[str, np.ndarray]]:
        # Loop over all layers          
        correlations = {}
        for layer in self.raw_txt_features.keys():
            print(f"> {layer}")
            # Assign train and val text features
            train_txt_features = self.raw_txt_features[layer][train_idxs]
            val_txt_features = self.raw_txt_features[layer][val_idxs]
            # Assign train and val vision features
            train_img_features = self.raw_img_features[layer][train_idxs]
            val_img_features = self.raw_img_features[layer][val_idxs]
            # Combine features
            train_features = np.concatenate([torch.tensor(train_txt_features).flatten(1).numpy(), 
                                             torch.tensor(train_img_features).flatten(1).numpy()], axis=1)
            val_features = np.concatenate([torch.tensor(val_txt_features).flatten(1).numpy(), 
                                           torch.tensor(val_img_features).flatten(1).numpy()], axis=1)
            del train_txt_features, train_img_features, val_txt_features, val_img_features

            # Assign train and val fmri
            train_lh_fmri = self.lh_fmri[train_idxs]
            train_rh_fmri = self.rh_fmri[train_idxs]
            val_lh_fmri = self.lh_fmri[val_idxs]
            val_rh_fmri = self.rh_fmri[val_idxs]

            # Fit PCA models
            print(f"Fitting PCA model for {layer}...")
            pca = PCA(n_components=200)
            train_pca_features = pca.fit_transform(torch.tensor(train_features).flatten(1).numpy())
            del train_features # free memory

            # Fit linear regression
            print(f"Fitting linear regression models for {layer}...")
            lh_lin_reg = LinearRegression().fit(train_pca_features, train_lh_fmri)
            rh_lin_reg = LinearRegression().fit(train_pca_features, train_rh_fmri)
            del train_pca_features, train_lh_fmri, train_rh_fmri # free memory

            # Transform validation features
            print(f"Transforming validation features for {layer}...")
            val_pca_features = pca.transform(torch.tensor(val_features).flatten(1).numpy())
            del val_features, pca # free memory

            # Predict validation set
            print(f"Predicting validation set for {layer}...")
            lh_val_pred = lh_lin_reg.predict(val_pca_features)
            rh_val_pred = rh_lin_reg.predict(val_pca_features)
            del val_pca_features, lh_lin_reg, rh_lin_reg # free memory
            
            # Calculate correlations
            print(f"Calculating correlations for {layer}...\n")
            lh_correlation, rh_correlation = self.calculate_correlations(lh_val_pred, rh_val_pred, val_lh_fmri, val_rh_fmri)
            lh_median_roi_correlation, rh_median_roi_correlation = self.calculate_median_correlations(lh_correlation, rh_correlation)
            
            # Store correlations
            correlations[layer] = {"lh": lh_median_roi_correlation, "rh": rh_median_roi_correlation} 
        return correlations

    def return_idxs(self) -> np.ndarray:
        return np.arange(len(self.raw_txt_features[list(self.raw_txt_features.keys())[0]])) 
Executing the k-Fold Procedure¶
In [ ]:
# Raise exception to prevent accidental execution
raise Exception("This code is not meant to be run. It is only meant to be used as a reference for the code used to generate the submission.")

# Initialize required objects
subject = Subject("subj01")

# Initialize kfold procedure
kfold_procedure = KFoldCombinedCLIPSingleSubjectComb(
    subject=subject,
    model_name="CLIP Vision + Text (Combined PCA)", 
    description="CLIP Vision + Text Model, Embedding layer and first 4 Transformer Layers, Combined PCA 200, Single layer linear regression, 8-fold cross validation", 
    layers=[0,1,2,3,4], # only first 5 layers
    last_hidden_state=False
    )
# Run first kfold
KFold(folds=8, procedure=kfold_procedure).run()

# Initalize kfold procedure for second batch
kfold_procedure = KFoldCombinedCLIPSingleSubjectComb(
    subject=subject,
    model_name="CLIP Vision + Text (Combined PCA)", 
    description="CLIP Vision + Text Model, Transformer layers 5 to 9, Combined PCA 200, Single layer linear regression, 8-fold cross validation", 
    layers=[5,6,7,8,9], # the next 5 layers
    last_hidden_state=False
    )
# Run second kfold
KFold(folds=8, procedure=kfold_procedure).run()

# Initalize kfold procedure for third batch
kfold_procedure = KFoldCombinedCLIPSingleSubjectComb(
    subject=subject,
    model_name="CLIP Vision + Text (Combined PCA)", 
    description="CLIP Vision + Text Model, Transformer layers 10 to 12 and Final Layer, Combined PCA 200, Single layer linear regression, 8-fold cross validation", 
    layers=[10,11,12], # the next 4 layers
    last_hidden_state=True # and final layer
    )
# Run third kfold
KFold(folds=8, procedure=kfold_procedure).run()

Discussion & Conclusions¶

In this study, CLIP is used to predict recorded brain activation across the whole brain of participants viewing complex natural visual scenes. Three CLIP-based approaches are tested: a Text model, a Vision model, and a combination of features from the Text and Vision models. First, we investigated the data. The neural RDM showed clustering among trials, indicating (dis-)similarity across brain activations for different trials. Additionally, the noise ceiling indicated a small amount noise, namely the variance that is not reliably shared across subjects.

When investigating the three approached, the Text model is the worst performing model to predict brain activation. Interestingly, the early visual ROI show low correlations in the text model, where later layers are better approximated by this model. This is intuitive, as the early visual layers in the brain are important for visual processing, therefore the Text model did not manage to predict these visual activation patterns. The Vision model however, showed high correlations for early visual layers. Unexpectedly, the Vision model also outperformed the Text model on later ROIs. This could be explained by the importance of language representations in visual processing. For instance in object recognition, one needs language categories to categorize the perception (Lupyan et al., 2020). These language processes are also present during visual processing, not only during language processing. To encorporate this language feature in visual perception in our model, we combined the text and vision model. However, contrary to our initial expectations and findings from the literature (Marjiet et al., 2023), this combination of features from the Text and Vision models performed equally or worse than the Vision model for the ROIs.

Because the Vision model outperformed the other two approaches, further testing was done on specific layers of the Vision model. We decided to use only a subset of layers, since when utilizing all layers, the features extracted by each layer may not be equally relevant to the task at hand. Moreover, using all layers may introduce noise or redundancy, which can reduce the effectiveness of the features. Combining layers 4 and 8 showed high correlations and predicted the fMRI activation better than the two seperate layers. This was our final submission for the Alganauts Challenge 2023. This could be because layer 4 is known to extract low-level visual features such as edges and textures, while layer 8 extracts high-level semantic features such as object categories and scene types. By combining the two layers, we may have a more comprehensive representation of the visual content of the images that captures both low-level visual details and high-level semantic information.

One possible explanation for the unexpected findings of lower performance of the combination of features from the Text and Vision models compared to the Vision model alone is that the language information provided through the COCO captions did not correspond to the stimuli from the Challenge data - which were cropped COCO images. This could mean that the language captions did no longer precisely describe what is displayed on the image. Therefore the model could not infer the relationship between the text and images by leveraging the cross-modal understanding and the advantage of adding text to the model was lost. This should be investigated to rule out this possible confound. However, if the cropping did not influence the precision of the captions, another explanation could be that combining different features of the same multi-modal model might result in a loss of information from both modalities when optimizing for a joint embedding, as recently suggested by Marjieh et al. (2023). Therefore, we propose that the future attempts to explain neural activity when viewing natural scenes implement stacking the representations of distinct text and vision models in contrary to the representations of the same multi-modal model.

To conclude, we showed that combining only relevant layers of CLIPVision that capture both low-level and high-level visual features can provide a more effective representation of the visual content of the images, in contrary to CLIPText or the combination of layers of CLIPVision and CLIPText models.

Challenge Submission: Models Submitted For Algonauts Challenge¶

SubmissionProcedure & CreateSubmission Classes¶

Below are the models we submitted to the Algonauts 2023 competition.

We wrote two classes, CreateSubmission and SubmissionProcedure, to make the process of generating predictions for submission as straight forward as possible. The CreateSubmission class creates folders for submission and loops over all subjects to apply the specific SubmissionProcedure.

In [ ]:
class SubmissionProcedure: 
    """Used to create a submission procedure that is executed for each subject in the CreateSubmission class."""

    def run(self):
        raise NotImplementedError

class CreateSubmission:
    """Create a new challenge submission."""
    def __init__(self, 
                 subjects : List[Subject],
                 procedure: SubmissionProcedure):
        self.subjects = subjects
        self.procedure = procedure

    def create_submission_folder(self) -> None:
        # create new submission folder with newest name
        submissions = glob.glob(f"submissions/submission*")
        if len(submissions) == 0:
            # create first submission folder
            self.folder_name = "submission001"
            os.mkdir(f"submissions/{self.folder_name}")
        else:
            # create next submission folder
            last_submission = sorted(submissions)[-1]
            last_submission_number = int(last_submission.split("/")[-1].split("submission")[-1])
            next_submission_number = last_submission_number + 1
            self.folder_name = f"submission{str(next_submission_number).zfill(3)}"
            os.mkdir(f"submissions/{self.folder_name}")
        # Write text file with model description
        with open(f"submissions/{self.folder_name}/info.txt", "w") as f:
            f.write(self.procedure.description)
        # create a folder for each subject
        for subject in self.subjects:
            os.mkdir(f"submissions/{self.folder_name}/{subject.subject}")

    def save_predictions(self, subject: Subject, lh_predictions: np.ndarray, rh_predictions: np.ndarray) -> None:
        """Save predictions for a subject."""
        lh_predictions = lh_predictions.astype(np.float32)
        rh_predictions = rh_predictions.astype(np.float32)
        save_path = f"submissions/{self.folder_name}/{subject.subject}"
        # Save predictions
        np.save(f"{save_path}/lh_pred_test.npy", lh_predictions)
        np.save(f"{save_path}/rh_pred_test.npy", rh_predictions)

    def run(self):
        self.create_submission_folder()
        for subject in self.subjects:
            print(f"############################")
            print(f"# Subject: {subject.subject}")
            print(f"############################")
            lh_predictions, rh_predictions = self.procedure.run(subject)
            self.save_predictions(subject, lh_predictions, rh_predictions)
            print(f"\n")

First and Second Submission - Vision Only Final Layer / Text Only Final Layer, PCA 100¶

To familiarize ourselves with the CodaLab submission system we selected our first two models without cross validation and used a 90%-10% train-validation split to guide our decision instead.

We ended up submitting two linear models based on PCA (100 components) transformed features of the final layer of the CLIPVision model (submission score = 43.262) and CLIPText model (submission score = 34.210).

In [ ]:
class CLIPVisionFinalLayerPCA100LinearRegression(SubmissionProcedure):
    def __init__(self):
        self.description = "CLIP Vision Model, Final Layer, PCA 100, Linear Regression, First test submission."

    def run(self, subject: Subject) -> np.ndarray:
        """Run the model on a subject."""
        # Prepare data
        subject.create_dataloaders(processor=processor, batch_size=300)
        subject.load_neural_data()
        train_img_dataloader = subject.train_img_dataloader
        test_img_dataloader = subject.test_img_dataloader
        lh_fmri = subject.lh_fmri
        rh_fmri = subject.rh_fmri
        del subject # free up memory

        # Prepare feature extractor
        train_feature_extractor = CLIPFeatureExtractor(idxs=[], last_hidden_layer=True, model=vis_model, dataloader=train_img_dataloader)
        test_feature_extractor = CLIPFeatureExtractor(idxs=[], last_hidden_layer=True, model=vis_model, dataloader=test_img_dataloader)

        # Extract features
        train_feature_extractor.extract_raw_features_from_model()
        raw_train_features = train_feature_extractor.feature_dict["Final Layer"]
        del train_feature_extractor

        # Fit PCA
        pca = PCA(n_components=100)
        pca_transformed_train_features = pca.fit_transform(torch.tensor(raw_train_features).flatten(1).numpy())
        del raw_train_features

        # Fit linear regression
        lh_lin_reg = LinearRegression().fit(pca_transformed_train_features, lh_fmri)
        rh_lin_reg = LinearRegression().fit(pca_transformed_train_features, rh_fmri)
        del pca_transformed_train_features

        # Extract test features
        test_feature_extractor.extract_raw_features_from_model()
        raw_test_features = test_feature_extractor.feature_dict["Final Layer"]
        del test_feature_extractor

        # Transform test features
        pca_transformed_test_features = pca.transform(torch.tensor(raw_test_features).flatten(1).numpy())
        del raw_test_features
        
        # Predict
        lh_predictions = lh_lin_reg.predict(pca_transformed_test_features)
        rh_predictions = rh_lin_reg.predict(pca_transformed_test_features)
        return lh_predictions, rh_predictions

class CLIPTextFinalLayerPCA100LinearRegression(SubmissionProcedure):
    def __init__(self):
        self.description = "CLIP Text Model, Final Layer, PCA 100, Linear Regression, Second test submission."

    def run(self, subject: Subject) -> np.ndarray:
        """Run the model on a subject."""
        # Prepare data
        subject.create_dataloaders(processor=processor, batch_size=300)
        subject.load_neural_data()
        train_txt_dataloader = subject.train_txt_dataloader
        test_txt_dataloader = subject.test_txt_dataloader
        lh_fmri = subject.lh_fmri
        rh_fmri = subject.rh_fmri
        del subject # free up memory

        # Prepare feature extractor
        train_feature_extractor = CLIPFeatureExtractor(idxs=[], last_hidden_layer=True, model=txt_model, dataloader=train_txt_dataloader)
        test_feature_extractor = CLIPFeatureExtractor(idxs=[], last_hidden_layer=True, model=txt_model, dataloader=test_txt_dataloader)

        # Extract features
        train_feature_extractor.extract_raw_features_from_model()
        raw_train_features = train_feature_extractor.feature_dict["Final Layer"]
        del train_feature_extractor

        # Fit PCA
        pca = PCA(n_components=100)
        pca_transformed_train_features = pca.fit_transform(torch.tensor(raw_train_features).flatten(1).numpy())
        del raw_train_features

        # Fit linear regression
        lh_lin_reg = LinearRegression().fit(pca_transformed_train_features, lh_fmri)
        rh_lin_reg = LinearRegression().fit(pca_transformed_train_features, rh_fmri)
        del pca_transformed_train_features

        # Extract test features
        test_feature_extractor.extract_raw_features_from_model()
        raw_test_features = test_feature_extractor.feature_dict["Final Layer"]
        del test_feature_extractor

        # Transform test features
        pca_transformed_test_features = pca.transform(torch.tensor(raw_test_features).flatten(1).numpy())
        del raw_test_features
        
        # Predict
        lh_predictions = lh_lin_reg.predict(pca_transformed_test_features)
        rh_predictions = rh_lin_reg.predict(pca_transformed_test_features)
        return lh_predictions, rh_predictions

Executing First and Second Submission¶

In [ ]:
raise Exception("This code is not meant to be run. It is only meant to be used as a reference for the code used to generate the submission.")

# First test submission
subjects = [Subject(f"subj0{i}") for i in range(1, 9)]
CreateSubmission(subjects, procedure=CLIPVisionFinalLayerPCA100LinearRegression()).run()

# Second test submission
CreateSubmission(subjects, procedure=CLIPTextFinalLayerPCA100LinearRegression()).run()

Third Submission - Vision Only Layer 7 PCA 200¶

For our third submission we used the 8-fold cross validation described above for the CLIP Vision Model, PCA200, Linear Regression. This submission resulted in our second highest score (49.318).

In [ ]:
class CLIPVisionLayer7PCA200LinearRegression(SubmissionProcedure):
    def __init__(self):
        self.description = "CLIP Vision Model, Transformer Layer 7, PCA 200, Linear Regression, Determined by 8-fold cross validation against all other layers."

    def run(self, subject: Subject) -> np.ndarray:
        """Run the model on a subject."""
        # Prepare data
        subject.create_dataloaders(processor=processor, batch_size=300)
        subject.load_neural_data()
        train_img_dataloader = subject.train_img_dataloader
        test_img_dataloader = subject.test_img_dataloader
        lh_fmri = subject.lh_fmri
        rh_fmri = subject.rh_fmri
        del subject # free up memory

        # Prepare feature extractor
        train_feature_extractor = CLIPFeatureExtractor(idxs=[7], last_hidden_layer=False, model=vis_model, dataloader=train_img_dataloader)
        test_feature_extractor = CLIPFeatureExtractor(idxs=[7], last_hidden_layer=False, model=vis_model, dataloader=test_img_dataloader)

        # Extract features
        train_feature_extractor.extract_raw_features_from_model()
        raw_train_features = train_feature_extractor.feature_dict["Transformer Layer 7"]
        del train_feature_extractor

        # Fit PCA
        pca = PCA(n_components=200)
        pca_transformed_train_features = pca.fit_transform(torch.tensor(raw_train_features).flatten(1).numpy())
        del raw_train_features

        # Fit linear regression
        lh_lin_reg = LinearRegression().fit(pca_transformed_train_features, lh_fmri)
        rh_lin_reg = LinearRegression().fit(pca_transformed_train_features, rh_fmri)
        del pca_transformed_train_features

        # Extract test features
        test_feature_extractor.extract_raw_features_from_model()
        raw_test_features = test_feature_extractor.feature_dict["Transformer Layer 7"]
        del test_feature_extractor

        # Transform test features
        pca_transformed_test_features = pca.transform(torch.tensor(raw_test_features).flatten(1).numpy())
        del raw_test_features
        
        # Predict
        lh_predictions = lh_lin_reg.predict(pca_transformed_test_features)
        rh_predictions = rh_lin_reg.predict(pca_transformed_test_features)
        return lh_predictions, rh_predictions

Executing Third Submission¶

In [ ]:
raise Exception("This code is not meant to be run. It is only meant to be used as a reference for the code used to generate the submission.")

# Third submission
subjects = [Subject(f"subj0{i}") for i in range(1, 9)]
CreateSubmission(subjects, procedure=CLIPVisionLayer7PCA200LinearRegression()).run()

Fourth Submission - Vision Only Layer 4 and 8 Combined PCA 200¶

For our fourth submission we used the 8-fold cross validation described above for the CLIP Vision Model, Layer 4 + 8, Normal & Combined PCA200, Linear Regression. This submission resulted in our highest score to date (50.541).

In [ ]:
class CLIPVisionL4L8PCA200LinearRegression(SubmissionProcedure):
    def __init__(self):
        self.description = "CLIP Vision Model, Transformer Layer 4 + Transformer Layer 8, PCA 200, Linear Regression, First test submission."

    def run(self, subject: Subject) -> np.ndarray:
        """Run the model on a subject."""
        # Prepare data
        subject.create_dataloaders(processor=processor, batch_size=400)
        subject.load_neural_data()
        train_img_dataloader = subject.train_img_dataloader
        test_img_dataloader = subject.test_img_dataloader
        lh_fmri = subject.lh_fmri
        rh_fmri = subject.rh_fmri
        del subject # free up memory

        # Prepare feature extractor
        train_feature_extractor = CLIPFeatureExtractor(idxs=[4,8], last_hidden_layer=False, model=vis_model, dataloader=train_img_dataloader)
        test_feature_extractor = CLIPFeatureExtractor(idxs=[4,8], last_hidden_layer=False, model=vis_model, dataloader=test_img_dataloader)

        # Extract features
        train_feature_extractor.extract_raw_features_from_model()
        raw_train_features = train_feature_extractor.feature_dict["Transformer Layer 4"]
        raw_train_features = np.hstack([raw_train_features, train_feature_extractor.feature_dict["Transformer Layer 8"]])
        del train_feature_extractor

        # Fit PCA
        pca = PCA(n_components=200)
        pca_transformed_train_features = pca.fit_transform(torch.tensor(raw_train_features).flatten(1).numpy())
        del raw_train_features

        # Fit linear regression
        lh_lin_reg = LinearRegression().fit(pca_transformed_train_features, lh_fmri)
        rh_lin_reg = LinearRegression().fit(pca_transformed_train_features, rh_fmri)
        del pca_transformed_train_features

        # Extract test features
        test_feature_extractor.extract_raw_features_from_model()
        raw_test_features = test_feature_extractor.feature_dict["Transformer Layer 4"]
        raw_test_features = np.hstack([raw_test_features, test_feature_extractor.feature_dict["Transformer Layer 8"]])
        del test_feature_extractor

        # Transform test features
        pca_transformed_test_features = pca.transform(torch.tensor(raw_test_features).flatten(1).numpy())
        del raw_test_features
        
        # Predict
        lh_predictions = lh_lin_reg.predict(pca_transformed_test_features)
        rh_predictions = rh_lin_reg.predict(pca_transformed_test_features)
        return lh_predictions, rh_predictions

Executing Fourth Submission¶

In [ ]:
raise Exception("This code is not meant to be run. It is only meant to be used as a reference for the code used to generate the submission.")

# Fourth submission
subjects = [Subject(f"subj0{i}") for i in range(1, 9)]
CreateSubmission(subjects, procedure=CLIPVisionL4L8PCA200LinearRegression()).run()

References¶

Articles¶

  • Allen EJ, St-Yves G, Wu Y, Breedlove JL, Prince JS, Dowdle LT, Nau M, Caron B, Pestilli F, Charest I, Hutchinson JB, Naselaris T, Kay K. 2022. A massive 7T fMRI dataset to bridge cognitive neuroscience and computational intelligence. Nature Neuroscience, 25(1):116–126. https://doi.org/10.1038/s41593-021-00962-x
  • Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint, arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929
  • Gifford AT, Lahner B, Saba-Sadiya S, Vilas MG, Lascelles A, Oliva A, Kay K, Roig G, Cichy RM. 2023. The Algonauts Project 2023 Challenge: How the Human Brain Makes Sense of Natural Scenes. arXiv preprint, arXiv:2301.03198. https://doi.org/10.48550/arXiv.2301.03198
  • Jozwik, K. M., Kriegeskorte, N., & Mur, M. (2016). Visual features as stepping stones toward semantics: Explaining object similarity in IT and perception with non-negative least squares. Neuropsychologia, 83, 201-226.
  • Marjieh, R., van Rijn, P., Sucholutsky, I., Sumers, T. R., Lee, H., Griffiths, T. L., & Jacoby, N. (2022). Words are all you need? Capturing human sensory similarity with textual descriptors. arXiv preprint arXiv:2206.04105.
  • Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (pp. 740–755). Springer International Publishing. https://doi.org/10.1007/978-3-319-10602-1_48
  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision (arXiv:2103.00020). arXiv. http://arxiv.org/abs/2103.00020

Packages:¶

  • Abraham, A., Pedregosa, F., Eickenberg, M., Gervais, P., Mueller, A., Kossaifi, J., ... & Varoquaux, G. (2014). Machine learning for neuroimaging with scikit-learn. Frontiers in Neuroinformatics, 14.
  • da Costa-Luis, (2019). tqdm: A Fast, Extensible Progress Meter for Python and CLI. Journal of Open Source Software, 4(37), 1277, https://doi.org/10.21105/joss.01277
  • Harris, C.R., Millman, K.J., van der Walt, S.J. et al.(2020). Array programming with NumPy. Nature, 585, 357–362. DOI: 10.1038/s41586-020-2649-2. (Publisher link).
  • J. D. Hunter. (2007). "Matplotlib: A 2D Graphics Environment", Computing in Science & Engineering, 9(3), 90-95.
  • McKinney, W., & others. (2010). Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, 445, 51–56.
  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32 (pp. 8024–8035). Curran Associates, Inc. Retrieved from http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  • Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., ... & Van Mulbregt, P. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods, 17(3), 261-272.